|
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces WaterSearch, a search-based LLM watermarking framework that generates multiple parallel candidates instead of a single one and then selects the one that best preserves coherence with the original text to enhance quality and detectability.
1. The idea of using multiple candidates is clear and effective.
2. The paper also includes a solid theoretical analysis of the proposed method.
3. Evaluations are comprehensive, on various models and tasks.
1. The overhead of this method seems to be significant.
2. A main weakness of the paper is the limited comparison against recent works. The paper mainly compared the original KGW method. More recent and stronger baselines, including both token-level and semantic-level watermarking methods, should be compared for a more comprehensive evaluation.
3. Robustness evaluation is also limited. Stronger modification and paraphrasing attackers, beyond deletion, insertion, and synonym substitution, need to be presented to show the superiority of the proposed method against SOTA.
See weaknesses. |
Fully human-written |
|
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces WaterSearch, a search-based watermarking framework that aims to improve the trade-off between watermark detectability and text quality in large language models (LLMs). Instead of modifying logits during generation as in the standard KGW framework, WaterSearch performs chunk-level parallel generation: it generates multiple candidate continuations (some watermarked, one unwatermarked) and selects the one that maximizes a joint score balancing semantic similarity to the unwatermarked text and detectability based on green-list token frequency. A chi-square–based detection procedure is proposed to test statistical significance across chunks. Experiments across 3 LLMs (Llama-2, Qwen-2.5, InternLM) and 10 datasets show consistent improvements in both generation quality and detection robustness, particularly under low-entropy and short-text conditions.
* Simple idea and framework: WaterSearch can be applied on top of existing KGW-style watermarking schemes with minimal modification.
* The method improves performance across all evaluated datasets, including difficult cases such as short-text or low-entropy settings, where KGW tends to fail.
* The paper uses WaterBench and additional benchmarks (e.g., RepoBench-P) and shows gains across multiple model families.
* Figures and algorithm descriptions make the approach easy to follow; the writing is concise and readable.
* Incremental conceptual novelty: The idea to generate several watermarked candidates and pick the best is intuitive, but very closely resembles beam search or rejection sampling. The contribution feels more engineering-oriented than conceptual, especially given that most of the theoretical development restates expected properties of the existing KGW trade-off.
* Computational inefficiency: WaterSearch performs parallel or beam-style generation of multiple watermarked candidates per chunk and selects the best one, which intuitively incurs substantial wall-time cost. While Table 4 discusses space complexity, runtime overhead or throughput (tokens/s) is not reported. Without this, it is hard to assess practical efficiency, but based on the runtime complexity reported in the paper, a ~5x slowdown in generation is fairly substantial and reduces the practical utility of the method.
* What is the actual computational overhead relative to vanilla KGW? Reporting wall-clock time or tokens/s for each configuration would clarify practical feasibility.
* How sensitive is the approach to the number of parallel candidates k? Does increasing k yield linear improvement in detectability, or diminishing returns?
* Could the same results be achieved by post-hoc reranking or constrained decoding (e.g., using logits rather than full re-generation)?
* Since quality is evaluated only relative to the unwatermarked model’s continuation, could semantic drift still occur if that baseline itself is low-quality or inconsistent?
* The claim that KGW “maintains text quality well from the perspective of perplexity or LLM-as-judge” is contradicted by results in WaterBench (Tu et al., 2024) and New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking (Singh et al., 2024), which show measurable degradations in both perplexity and subjective fluency for KGW. The discussion should acknowledge these findings. |
Fully AI-generated |
|
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes WaterSearch, a novel search-based framework for watermarking Large Language Model (LLM) outputs. The core idea is to move beyond token-level watermark embedding by generating multiple candidate text chunks in parallel and selecting the one that best balances text quality (fidelity to the original, unwatermarked model distribution) and watermark detectability (statistical strength of the watermark signal). The method is presented as a solution to the fundamental trade-off between these two objectives in existing watermarking schemes like KGW. The authors also introduce a new detection algorithm based on hypothesis testing and provide extensive experimental results showing significant improvements over strong baselines.
1. The shift from a purely token-level manipulation to a chunk-level search-and-select paradigm is a significant and compelling contribution. It elegantly reframes the watermarking problem as a multi-criteria optimization, which directly addresses a well-known limitation of existing methods.
2. The paper provides a theoretical analysis (Theorem 1) linking the macroscopic (sentence-level) selection objective with the microscopic (token-level) watermarking trade-off. This strengthens the methodological foundation and justifies the proposed approach.
3. The experiments are thorough and well-designed.
* Comprehensive Benchmarking: Evaluation across 10 diverse tasks and 3 major LLMs (Llama-2, InternLM, Qwen) demonstrates generalizability.
* Significant Performance Gains: The reported average improvement of 51.01% in downstream task performance under fixed detectability is impressive and clearly highlights the method's value.
* Robustness in Challenging Scenarios: The strong results on short-text (+47.78%) and low-entropy (e.g., code generation, +36.47%) scenarios are particularly noteworthy, as these are known pain points for current watermarks.
* Exceptional Attack Resilience: Maintaining high detectability under 80% word-level perturbations (deletion, insertion, substitution) is a remarkable result that significantly outperforms baselines.
1. Algorithm 1 requires generating $k$ candidate chunks in parallel at each step. This implies the generation time (latency) will be roughly $k$ times that of a baseline method. The paper's claim of "low computational cost" is misleading as it primarily focuses on memory (KV cache).
2. Algorithm 2 (Detection) appears to require the detector to "Recover the seeds from generation". This suggests the detector must know the exact context $c$ and the random seed generator used during generation. This is a much stronger assumption than KGW (which only needs a secret key) and may be fragile in black-box detection or if the context is slightly modified.
3. The experiments fix $k$ (beam size) to 5 (1 vanilla + 4 watermarked). $k$ is a critical hyperparameter balancing quality, detectability, and latency, but the paper lacks a sensitivity analysis or ablation study on $k$.
1.How exactly are the k−1 watermark seeds generated from context and recovered at detection time? Is the seed generator deterministic and keyed? What are attack consequences if this procedure is partially known?
2.Can you provide wall-clock runtime and peak GPU memory numbers for representative settings (e.g., k=5, chunk m=32) on a standard GPU? The asymptotic KV discussion is useful but practitioners will want absolute numbers.
3.Have you tried stronger/adaptive attackers (e.g., paraphrase-based sentence rewriting engineered to minimize green-token counts) or defenses that specifically target chunk-final tokens? If so, what happens to detectability?
4.For Theorem 1, can you relax the token-independence assumption or empirically show how well the mapping (f) holds across tasks? |
Fully AI-generated |
|
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
By constraining semantic consistency with the original sentence as much as possible, generating multiple outputs with different random seeds, and combining these outputs, this paper identified the optimal balance between semantic integrity and watermarking.
1. The design of alpha is rigorous overall, with theoretical proof of its effectiveness and ablation studies demonstrating optimal alpha values, reasonably extending the KGW method.
2. Effectively designed time complexity to ensure computational resources increase only moderately.
3. Experiments demonstrate that sufficiently large differences between random seeds enable multiple outputs of the watermark to combine into text with semantics closer to the original meaning, including the validity of other hyperparameters such as K.
1. Missing Visualization examples of all results, just an NBA example
2. Scoring q is a linear add-up of semantic similarity towards the original output, and watermarking quality, which is very straight forward, but can be questioned that if the linear add-up is effective or not, more theoritical supports are needed
3. Strategy of picking different random seed is still not clear enough for me
How will the method perform on bigger and SOTA language models? |
Fully human-written |