ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (67%)	3.00	3.50	3948
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	1 (33%)	4.00	4.00	4739
Total	3 (100%)	3.33	3.67	4212

Title	Ratings	Review Text	EditLens Prediction
HoVer: Holistic Verification for Semantic-Aware Speculative Generation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes HoVer (Holistic Verification), a semantic-level speculative generation framework. A small “draft” model first generates a full answer. A large “base” model then (i) checks whether the full draft is correct using a prefill-only pass that emits a single classification token (“Correct”/“Incorrect”), and if incorrect, (ii) identifies the longest correct prefix in one additional prefill-only pass by attaching duplicated chat-template suffixes and applying a custom attention mask so each suffix attends only to its paired prefix. The base model resumes decoding from that boundary, reusing the verified prefix. The method aims to be orthogonal to token-level speculative decoding and can be combined with it. Experiments with Qwen3‑235B as the base model and Qwen3‑1.7B/8B as drafts report ~1.2×–3.1× speedups with small accuracy deltas on MMLU‑Redux, MMLU‑Pro, SuperGPQA, GSM8K, MATH, and GPQA‑Diamond; ablations indicate the partial verification step is the main driver of the gains. Clear articulation of a practical goal: reuse correct parts of a draft and avoid full regeneration. The three‑stage framework (Draft → Holistic Verify → Revise) is simple to implement on top of HF Transformers. The custom attention mask plus suffix duplication is a neat way to make multiple independent “Correct/Incorrect” decisions in a single forward pass—this reduces the otherwise linear cost of per‑prefix checks. The paper shows that partial verification is the main contributor to speedups; without it, much of the gain disappears, which is an honest and informative ablation. The method is compatible with standard token‑level speculation and can add speed on top of it when acceptance length is high (e.g., MATH). The implementation details and prompts in the appendix make replication plausible for readers with sufficient hardware. ## Novelty and missing related work. The paper positions HoVer as the first “holistic, parallel” semantic verifier for general text. However, there are highly relevant works that the paper does not engage with: - Packed/parallel masked verification: SPACE (“Generation Meets Verification: Aligning LLMs with Auto‑Correct Decoding”) uses a specialized attention mask to generate and verify multiple tokens in parallel and is conceptually close to “one‑pass parallel verification with masking.” The paper should compare and clarify differences (scope: token vs. span; training‑time vs. inference‑time; masking structure). - Reflective/tree‑masked semantic verification: Think Before You Accept introduces a tree‑based verification with a sparse attention mask that packs multiple candidates into one forward pass; although focused on reasoning, the mechanism overlaps with HoVer’s packed masking idea. A discussion is required. - Judge Decoding shares the same idea to accept some semantically correct draft tokens by training an additional verifier. The paper should compare the effectiveness and applications of both methods. Given these, HoVer’s engineering contribution is incremental rather than fundamentally new. ## Evaluation scope and baselines. All datasets are short‑form, multiple‑choice, or math (MMLU‑Redux, MMLU‑Pro, SuperGPQA, GSM8K, MATH, GPQA‑Diamond). Claims of broad applicability to “general‑purpose text generation” (open‑ended QA, long‑form writing, code) are not tested. There is no end‑to‑end latency analysis on long outputs where prefill dominates; reporting tokens/s averaged over benchmarks can hide latency tails and user‑visible time‑to‑first‑token (TTFT). The paper also avoids direct comparisons to recent semantic speculation methods, claiming “orthogonal settings,” yet it does compare to token‑level speculation; this is inconsistent. ## Baselines and combinations. The “HoVer + token‑level speculation” combination uses Hugging Face Assistant Speculative Decoding as implementation, but there is no detailed control of acceptance lengths vs. overhead across datasets, and no direct comparison to stronger token‑level accelerators (EAGLE‑2/3, Dynamic Depth). Without these, the synergy claims remain anecdotal. 1. Provide a formal description (or pseudo‑code) of the attention mask construction and prove that suffix tokens cannot attend to non‑paired prefixes. What happens with rotary position encodings or ALiBi when suffixes are duplicated? 2. Show results with at least one non‑Qwen family base model (e.g., Llama‑3‑70B/405B, Mixtral‑8×22B). Does the “Correct/Incorrect” token heuristic transfer across chat templates and tokenizers? 3. Add long‑form QA or summarization benchmarks with human or LLM‑judge evaluation of semantic fidelity. For code, choose HumanEval+MBPP. If HoVer is “general‑purpose,” it should hold up beyond MC and math. 4. Where does the verifier mis‑locate error boundaries? Include error heatmaps (by token position) and examples where the base model wrongly accepts a flawed prefix. If the target model rejects the whole draft-generated text, does Hover slow down the overall inference speed?	Fully AI-generated
HoVer: Holistic Verification for Semantic-Aware Speculative Generation	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes HoVer (Holistic Verification), a speculative generation framework for accelerating large language model inference. Instead of verifying each token as in traditional speculative decoding, HoVer performs semantic-level verification of full candidate outputs using a prefill-only forward pass of the larger “base” model. The base model identifies the longest verified prefix of the draft’s output and resumes generation from there. Experiments using Qwen3-235B as the base model and smaller Qwen3-1.7B/8B draft models show latency speedups (1.2×–3.1×) with small degradation on benchmarks such as MMLU and GSM8K. 1. Accelerating LLM inference without retraining is an important and timely objective. 2. The idea of leveraging semantic-level information aligns better with human notions of correctness. 1. The framework relies on a binary correct/incorrect verification signal from the base model, which is often not reliable enough to capture true semantic correctness. Large models can mistakenly validate outputs that are syntactically fluent but logically or factually wrong, undermining the claimed benefits of “holistic” verification. 2. The paper does not offer a clear explanation of how segmentation and prefix reuse are implemented, particularly how special tokens are inserted to partition outputs for error detection and rewriting. 3. The evaluation is limited to a narrow range of tasks and model families. 1. The paper mentions that error detection and output rewriting require partitioning the generated text into semantic segments and reusing the longest verified prefix, but it is unclear how these segments are defined or how special tokens are inserted to mark them. Are segment boundaries chosen heuristically, or is there an algorithmic rule (e.g., sentence boundaries, punctuation, or token count)? How do these boundaries affect verification accuracy? More detail on the tokenization or prompting process would be helpful. 2. Since HoVer uses the base model’s binary correct/incorrect judgment, how reliable is this verification in practice? Have you measured the false acceptance and false rejection rates? Is the verifier’s confidence calibrated in any way, or are decisions made purely on the model’s next-token likelihoods? 3. How does HoVer handle partially correct answers — for example, when only middle part of a reasoning chain or sentence is valid? 4. Can the same verification prompts and segmentation strategy generalize to other domains beyond the selected reasoning datasets? 5. The paper provides only limited baselines, which makes it difficult to position the contribution relative to existing work. A fair and convincing comparison with recent speculative decoding systems or with efficient reasoning frameworks is essential to demonstrate the true advantages of HoVer.	Fully AI-generated
HoVer: Holistic Verification for Semantic-Aware Speculative Generation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	HoVer is a semantic-level speculative generation method. Whereas in speculative decoding draft generations are only accepted if they match target model generations token-by-token, in HoVer draft generations are accepted if the target model deems them to be semantically correct. Given a draft generation, HoVer verifies all prefixes of the draft in parallel, and either accepts the full draft or uses the target model to continue generation from the last correct position. The authors evaluate HoVer on the Qwen3 model family, using the largest variant as the target model and two smaller variants as draft models. On a variety of general reasoning tasks, HoVer generates faster with little performance degradation when compared against direct generation with the target model. - The parallel prefix verification is a very interesting approach and is novel (to the best of my knowledge). - The results on the Qwen3 model family are strong. The experiments indicate HoVer provides consistent speedups while incurring negligible performance loss. - Examples and case studies clearly communicate how HoVer works. - Extensive ablations demonstrate the importance of partial verification and orthogonality to speculative decoding. - Evaluation is only on a single model family (Qwen3). Evaluations with other model families, or mixing model families between the draft and base model, would strengthen the claims. - Despite drawing attention to the fact that HoVer is designed for general-purpose text generation while previous methods focus on reasoning, all of the evaluation benchmarks are reasoning benchmarks. To support the general-purpose claim, experiments on more general tasks are needed. For example, the task could be responding to prompts from WildChat or Chatbot Arena, with quality measured by human studies or LLM-as-a-judge. - There is a lack of baseline methods to compare against, and I respectfully disagree with the authors’ decision not to compare against previous work focusing on chain-of-thought reasoning. Those works also aimed to speed up generation while retaining accuracy, and evaluated on some of the same benchmarks as this work. The argument that previous works require segmentations on reasoning steps feels a bit weak as one can easily generalize “reasoning steps” to non-reasoning tasks (e.g., segmenting on sentences, etc.). I agree that previous works verify sequentially, so comparing against those baselines would better contextualize the efficacy of the proposed parallel semantic verification. - In addition, another possible baseline is a model cascade setup where rather than having the large model generate from scratch, it is allowed to specify the edits it wants to make to the small model generation. - If I am understanding it correctly, the without partial verification ablation is a variant of model cascade: the draft model generates a response that the base model either accepts or regenerates. It may make more sense to frame that as a baseline rather than an ablation. - No confidence intervals are provided for any of the numbers in the results. This makes it hard to contextualize the significance of the speedups. - Is the “last correct position” (see line 291) determined by the Error Detection stage the last position for which all previous prefixes are correct or the longest correct prefix? If it is the latter, is there a risk that for long draft generations, a late false positive becomes almost guaranteed? - It appears that the parallel prefix verification prompt and the full answer verification prompt expect the base model to directly respond with Incorrect/Correct. What was the reason behind this choice (i.e., why do you not allow the model to reason)? - In lines 207-209 you discuss how setting higher thresholds improves the base model’s coverage of incorrect draft generations. How does the false positive rate change with threshold and how does that affect efficiency? - Could you explain in more detail what you mean by “Partial verification can correctly identify the position of errors, ensuring that performance does not degrade compared to the baseline” (lines 423-425)? Empirically in Figure 7 it appears that performance did not significantly degrade for both with and without partial verification. And a writing note: - Lines 425-427 (“The Mixed Wrong Rate … draft/base model alone”) were hard for me to parse. As I understand it, you are describing how the base model rarely makes a mistake continuing from a partial verification when the original problem is a problem the base model would have gotten correct under direct prompting. However, it did take me multiple re-reads to understand it, so I would suggest considering re-structuring the sentence.	Fully human-written

PreviousPage 1 of 1 (3 total rows)Next