ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (100%) 2.00 3.67 4892
Total 3 (100%) 2.00 3.67 4892
Title Ratings Review Text EditLens Prediction
Think First, Then Select and Verify with Query–Key Alignment Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes using internal QK-scores (attention scores without RoPE + before softmax) for 1. MCQA Evaluation (Section 4.3): selecting the model’s choice during Multiple Choice QA evaluation (instead of generating answer token) 2. Self-Verification (Section 4.4): verifying model’s own reasoning step 3. Candidate Generation Selection (Section 4.5): selecting the most promising generation (instead of majority voting) While previous works (both cited in the paper) have explored the use of QK-scores for MCQA without CoT [1] and Self-Verification [2], the paper hopes to extend the setting to MCQA with CoT and Candidate Generation Selection. The extension from MCQA without CoT to MCQA with CoT is barely incremental. The paper claims that they have novelty beyond [2], but it is unclear how their setting on Self-Verification is different. The only novel contribution, if any, would be the Candidate Generation Selection. However, this setting can also be understood as a variation of an MCQA (the question is "which generation is most promising" and each option is the candidate generation). [1] Tulchinskii, Eduard, et al. "Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA." arXiv preprint arXiv:2410.02343 (2024). [2] Tulchinskii, Eduard, et al. "Quantifying Logical Consistency in Transformers via Query-Key Alignment." arXiv preprint arXiv:2502.17017 (2025). I find no strength in this paper. Instead, I will elaborate on its weaknesses. # Lack of Novelty/Contribution Using QK-values to choose the best option in Multiple Choice QA evaluation has already been proposed [1]. The only addition this paper proposes is the usage of chain-of-thought (CoT). This is barely incremental. Furthermore, the paper later states that with CoT, the baseline approach (letting model generate the answer token) catches upto the performance of their proposed QK-value-based method (line 193, page 4). This further downweights the significance of their results. Using QK-values to verify the model's own answer has already been proposed [2]. It is unclear how the proposed setting in this paper is any different from [2]. Selecting the most promising generation can be understood as a Multiple Choice QA, as is currently written. The question would be "what is the most promising generation" and the list of options would be each of the candidate generations. [1] Tulchinskii, Eduard, et al. "Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA." arXiv preprint arXiv:2410.02343 (2024). [2] Tulchinskii, Eduard, et al. "Quantifying Logical Consistency in Transformers via Query-Key Alignment." arXiv preprint arXiv:2502.17017 (2025). # Lack of Details of Experimental Setup The paper is lacking details that are fundamental to understand the experimental setup. 1. The prompt examples for Section 4.4 and Section 4.5 2. Whether the MCQA evalution is done zero-shot (Section 4.3) 3. The decoding setting (temperature, top-p etc.) 4. How exactly is the best QK head selection from the validation set? 5. Why do you need an external LLM-as-a-judge (Qwen3-70B) (line 247)? The answer for MATH-500 should be standardized right? Is is just for HLE? What is the prompt for this external LLM-as-a-judge? etc. etc. Many of the terms are not used consistently throughout the paper and add to the confusion. For example, MCQA+CoT (line 16), MCQA with CoT (line 67, 196, 216), MCQA with reasoning (line 51, 118, 124), MCQA-with-CoT-reasoning (line 128), MCQA with integrated CoT reasoning (line 181) all seem to refer to the same thing. # Experimental Setup Copied From Previous Works? Some parts of the experimental setup are either out-of-place or follow too closely to the two previous work s [1, 2], but do not properly acknowledge this resemblance. The term "premise" in the paragraph under **QK-score and connection between reasoning parts.** (lines 100-1345) seems out of place, potentially except for the Logic Consistency Verification (line 129). It is also unclear why "premise" would be abbreviated as "c" (line 102). Upon further inspection, it seems like this term was directly copied from [2]. Relevant part from [2]: "In our setup, each input consists of a context c (which provides the premises), a statement s (a candidate conclusion), and a candidate answer a_i" (page 2). The paper mentions: "When it is not stated otherwise, we do not aggregate predictions or QK-scores from multiple attention heads. Instead, in each experiment we use a separate calibration subset of the data from the same domain to select the single best performing head." (line 137-139). This is weird since the paper never mentions aggregating scores acros attention heads. Upon inspection, this seems to be modified from a similar sentence from [1]: "We do not aggregate heads predictions. Instead, we use the scores from the single best head, which is selected by the accuracy on the validation set D_val." (page 4). The paper defines Permutational Accuracy (PA) in Equation 1 and explains "where I_i is the indicator value equals to 1 if the model answers question i correctly, while I^p_i equals to 1 iff the model answers question i correctly after answer options were permuted." (lines 174-175). This is almost word-to-word copied from [1]: "where I_i is the indicator value equals to 1 iff model’s answer on question i is correct, while I^p_i equals to 1 iff model gives correct answer on question i after its options (their texts not letters) were permuted" (page 6). None of these similarities are properly acknowledged. [1] Tulchinskii, Eduard, et al. "Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA." arXiv preprint arXiv:2410.02343 (2024). [2] Tulchinskii, Eduard, et al. "Quantifying Logical Consistency in Transformers via Query-Key Alignment." arXiv preprint arXiv:2502.17017 (2025). # Questionable Results The baseline numbers for MCQA (just letting model generate answer tokens) seem off. See Table 1 for example. 1. DeepSeek-R1-Distill models and Qwen3 models have lower / similar numbers than LLaMA-3.1-8B? 2. No benefit of model scale? For example, 8B/14B/32B models do not show a smooth increase in performance? 3. Qwen3 numbers are significantly lower than expected. For example, Qwen3-32B should have 68% on MMLU-Pro (5-shot) [3]. Even if this paper used zero-shot and used a different evalution setting (which is why the paper should have added more details on what setting they used), the number should not change this much. [3] Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025). # Internally Inconsistent Content Now this is getting into the interesting part. Upon manual inspection of the codebase provided by the authors, not only do they not provide the full codebase, some of their results do not match the content shown in the paper. For example, in the 16th and 17th output cell of the `HLE_MCQA_qwen3_14b.ipynb` file in the codebase, they report "Baseline: 0.336 QK: 0.354" and "Baseline: 0.36 QK: 0.35", both of which numbers are not consistent with Table 2. For another example, in Figure 1, the paper claims that they use "Options:\n" in the prompt for MCQA. However, in the provided `HLE_MCQA_qwen3_14b.ipynb` file in the codebase, such text is not included. # Numerous Mistakes in the References Section There are many mistakes in the References section that raise suspicision. These mistakes are not typical mistakes made by humans. 1. The paper writes "DoLa: Decoding by contrasting layers improves **factuality and faithfulness**" (line 334-335). The correct title of this paper is "DoLa: Decoding by Contrasting Layers Improves **Factuality in Large Language Models**" 2. The paper includes "Decoding-time baseline." (line 335) at the end of an entry under Chuang et al. The paper also includes "Introduces the GSM8K benchmark." (line 338) at the end of an entry under Cobbe et al. These seem to be comments that humans generally won't include. 3. For the entry under Ren et al. (lines 367-368), the name of the journal is simply "Proceedings on" # Nitpicky Details 1. The paper changed the margin setting. Left margin is 1inch? But this is interesting since the paper will still fit under the 9 page limit with the permitted margin setting. 2. No description of the green color in Table 3. 3. 16% in Table 3 is not colored green. It also should be 14% instead. 1. Can you provide more details into the experimental setup? Specifically, the prompts for Section 4.4, Section 4.5 would be useful. Also, please provide the decoding budget and the temperature / top-p setting for all experiments. 2. Can you explain how the baseline numbers were derived in Table 1 and 2? Why are the numbers so different from what you might expect of the models? 3. If the Baseline approach with CoT performs as well as your QK-value-based approach, what is the significance? 4. In your abstract, you say "By leveraging this signal, we surpass the performance of full-scale, preference-optimized LLMs on two fundamental reasoning tasks: multiple-choice question answering and solution correctness validation." What are the "full-scale preference-optimized LLMs"? 5. In your abstract, you say "Our method achieves performance gains of up to ≈ 22% across various benchmarks and models." Where do you get a gain of "22%"? Fully human-written
Think First, Then Select and Verify with Query–Key Alignment Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper is not written well enough for me to be able to understand it clearly and appreciate the contributions. Based on what little I could make out, this paper is trying to solve MCQA task by forcing model to do CoT before selecting final answer choice. And, in that process, they are aiming to use QK scores which is part of self-attention calculation anyways. It is unclear to me what is novel idea in this paper and why standard self-attention mechanism can’t handle what is being proposed here. I am neither very clear about the research gap that is being addressed in this paper nor the novelty of the contributions. At the least, this paper needs a thorough rewriting for me to be able to understand its contributions clearly and appreciate the same. - None that I could identify. - The writing style of the paper is quite informal and unclear. A thorough rewrite is needed for Section 3 to convey the ideas clearly. - This paper lacks the novelty of the problem definition as well as solution approach. The proposed ideas are well known in the literature. - Its unclear why standard self-attention would not take care of the QK-score that is being mentioned here. - Its unclear what and how reasoning is happening in MCQA dataset. No illustrative example is provided. - There are some ill formed sentences at multiple places. For example, look at line number 123-124. - Its unclear why standard self-attention would not take care of the QK-score that is being mentioned here. - Its unclear what and how reasoning is happening in MCQA dataset. No illustrative example is provided. Fully human-written
Think First, Then Select and Verify with Query–Key Alignment Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a white-box method for answer selection and verification in LLMs. The main idea is to use the raw Query-Key (QK) dot-product score from the transformer's attention mechanism as an "internal signal" from the model. The authors claim that a "think-first" phase via CoT prompting strengthens internal QK alignment, allowing for more reliable selection of answers directly from model activations. This method is evaluated in several settings, such as MCQA, verification, and hypothesis selection. The MCQA setting is tested on the MMLU-Pro and HLE-1/4 datasets; the verification and hypothesis selection settings are tested on MATH-500 and HLE-1/4. HLE-1/4 is an authors' adaptation of the Humanity Last Exam benchmark. Additionally, the MATH-500 dataset is used for testing open-ended reasoning. The authors report significant performance gains of up to ~22% for this method. - Proposed method is novel. While prior work used QK for probing, this paper uses it as a decision rule for selection and verification. - Some of the reported results are impressive. 1) In the MCQA setup on MMLU-Pro, the QK-score method outperforms the baseline for several models, such as Qwen-14B: 17.72% -> 44.42% or Qwen-32B: 16.6% -> 49.32%. 2) In the hypothesis selection, the QK-score method (tested with LLaMA-3.1 8B on the MATH-500 dataset) outperforms baseline by almost 22 pp. - Figure 2 shows a high correlation of head performance between MATH-500 and HLE-1/4 for hypothesis selection. This provides evidence that QK-score method captures more generalizable signal. - I appreciate the usage of the PA metric. Positional bias is often forgotten when using MCQA tasks. - The entire method relies on a crucial and potentially fragile step -- selecting a single best-performing head from a calibration dataset. The paper provides no analysis of how stable this selection is. How large does the calibration set need to be? What is the performance distribution across heads? Is there only one good head, or are there many of them? Is head selection done in each setting/experiment separately? If yes, it weakens the claim of generalization of the method. The lack of these analyses is a major limitation of this work. - The method heavily depends on the choice of "premise-representing" and "response-representing" tokens. The paper lacks an ablation study to show how sensitive are the results to this choice. - While some results presented are strong, others contradict the paper's main narrative. For example in the MCQA with CoT setting, the QK-score method underperforms the baseline for several models (on HLE-1/4 -> DeepSeek-R1-Distill-Qwen-14B: 33.25% acc on baseline vs. 31.56% acc on QK-score; Qwen3-14B: 33.06% acc on baseline vs. 29.06% acc on QK-score). These underperformances are not discussed or explained, weakening the claim that the QK-score is a systematically better selection mechanism when CoT is used. - The MCQA baseline in tables 1 and 2 is not explained. - How stable is the "golden head" selection? Is the same head selected across different tasks and datasets for a given model? - Please provide results for a top-k head ensemble to check for robustness, rather than relying on a single head. - Why does the QK-score method underperform the baseline after CoT is applied in several cases in Table 2? This seems to contradict the main hypothesis. - Please provide an ablation study on the choice of "premise-representing" and "response-representing" tokens to justify using punctuation. - How was the "MCQA Baseline" in Tables 1 and 2 calculated? I am willing to increase my score if these concerns are resolved (especially regarding the head selection). Fully human-written
PreviousPage 1 of 1 (3 total rows)Next