ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	6.00	4.00	3108
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (25%)	2.00	4.00	1869
Fully human-written	2 (50%)	2.00	3.50	2858
Total	4 (100%)	3.00	3.75	2674

Title	Ratings	Review Text	EditLens Prediction
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors use model uncertainty as process supervision to guide the model’s reasoning steps. To perform uncertainty estimation, they train a lightweight value head in a supervised manner to predict the model’s uncertainty. The training data are labeled either by the model itself serving as the supervisory model or by a third-party supervisory model. The authors conduct experiments on mathematics, planning, and QA datasets, comparing their approach with several unsupervised uncertainty estimation methods and third-party process reward models. 1. The method proposed by the authors is lightweight — it only requires training a value head, which makes it highly efficient. 2. The authors proposed an automated training data construction scheme. 3. The authors conducted extensive experiments, comparing their approach across datasets from three different domains. 1. The proposed method lacks novelty, as many prior works have already trained process reward models (PRMs) or used uncertainty estimation as a supervision signal. For example, the baseline methods cited by the authors employ similar ideas. The main contribution of this paper is merely implementing such supervision through a lightweight value head. And UHead is also an existing work. 2. The authors' definition of uncertainty lacks rigor. Generally speaking, a metric trained directly from accuracy should not be regarded as a measure of uncertainty. For example, when a model produces a particular wrong answer very frequently during random sampling, its uncertainty about this that answer should be very low. However, under the training method proposed by the authors, such a case would yield a high uncertainty value. For the definition of uncertainty, I recommend reading this paper: https://arxiv.org/pdf/1802.10501 Have you compared the results between full-parameter fine-tuning and Uhead?	Lightly AI-edited
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses the challenge of verifying intermediate reasoning step correctness in LLMs’ multi-step reasoning and proposes a lightweight Uncertainty Quantification Head (UHead) to replace computationally expensive Process Reward Models (PRMs). - The proposed method is compared with comprehensive baselines. - Process reward is important for the development of Large reasoning models. 1. The proposed method is not clear, how is U (r(j)t ∣r(j)<t , x) be estimated, what's the archecture of the U-heads. 2. It seems that this paper utilizes the U-head to learn the uncertainty for process-reward estimation. Since the U-head is from another work, what's the contribution of this work? 3.The method should be evaluated on the latest PRM benchmark like PRMBench U-head contains few parameters compared with LLMs, but does it rely on the embedding or hidden states of LLMs? If In that case, we can not say that U-head is an more efficient methon than some simple baseline like LLM-as-judge.	Fully human-written
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces UHeads (Uncertainty quantification Heads), a lightweight alternative to Process Reward Models (PRMs) for verifying step-level correctness in LLM reasoning chains. UHeads are small transformer modules (<10M parameters) trained on frozen LLM internal states to predict step-level uncertainty, with labels generated either by larger models (DeepSeek-R1) or through self-supervision. The authors demonstrate that despite being 750-810× smaller than PRMs, UHeads achieve competitive performance across mathematics, planning, and QA tasks, particularly excelling in out-of-domain scenarios, suggesting that LLMs' internal states encode meaningful uncertainty signals for reasoning verification. 1. The proposed UHeads achieve comparable or superior performance to PRMs while using 750-810× fewer parameters (9.8M vs 7-8B), offering a highly efficient alternative for step-level reasoning verification that significantly reduces inference costs and memory requirements. 2. UHeads demonstrate superior generalization capabilities, particularly on OOD tasks where they consistently outperform much larger PRMs, suggesting they capture more transferable uncertainty signals rather than overfitting to domain-specific patterns. 3. The automatic annotation pipeline eliminates requirements for human labels, verifiable final answers, or costly Monte Carlo rollouts, supporting both external supervision (via DeepSeek-R1) and self-supervision approaches with comparable performance. 1. Tables 2-4 and 6 consistently show UHeads underperforming strong PRM baselines on in-domain mathematical tasks (MATH, GSM8K), with gaps of 5-10% in PR-AUC, raising questions about whether the computational savings justify the accuracy trade-off for domain-specific applications. 2. The 256-token generation limit during training data creation may constrain the method's applicability to more complex reasoning tasks like AIME problems that require tens of thousands of tokens, potentially limiting the approach's generalizability. 3. Given that UHeads require training on specific LLM internal states while PRMs can be used off-the-shelf across different models, and considering the performance gaps on certain tasks (e.g., ScienceQA where RLHFlow-PRM-DeepSeek significantly outperforms), the overall value proposition compared to training a single general-purpose PRM remains unclear. 1. How does the approach handle step boundary definition in complex reasoning chains that include self-verification, backtracking, or recursive refinement? The paper's reliance on structured prompts may not generalize to more naturalistic reasoning patterns. 2. In Section 2.3, the notation P(y\|x,D) appears problematic since training on data D fundamentally changes model parameters θ rather than just conditioning the distribution. Should this be reformulated as P_θ'(y\|x) where θ' represents post-training parameters? 3. What training factors contribute to UHeads' underperformance on in-domain tasks compared to PRMs? A deeper analysis of failure modes and potential improvements would strengthen the paper's contribution.	Fully AI-generated
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a method for step-wise verification based on quantifying the uncertainty involved in the reward predictions for each reasoning step. It implements a UHead, a classification module on top of LLM’s hidden states and uses its predictions for scoring verification rewards. Empirically, the paper provides experiments in step-level correctness and offline/online best-of-N using verifier-guided search, claiming on par performance with 7B-8B PRM models and strong OOD generalization. - The idea of using UQ methods for unsupervised/self-supervised verification is well motivated and justified. - The paper brings several baselines, particularly on UQ methods for verification, which is a good benchmarking for UQ-based model-based verification research. - The paper brings an interesting OOD evaluation setting, which is very relevant but overlooked in PRM research. - The major concern in the paper is its clarity/presentation on describing the proposed methodology. The background section brings a section about UQ but does not follow up on it when describing the method in Section 3. The core technique in the paper is the UHead, but this is not formally described in the paper, making it not self-contained. There are no details on how uncertainty is estimated or how/whether the terms in the equation of Section 2.3 are computed, nor what is the nature of the uncertainty estimated (e.g., predictive, epistemic). - From the description provided in Section 3, the UHead seems to be a classification network on top of a base LLM hidden state, and the uncertainty here relates to the softmax entropy among Yes/No classes. If this understanding is correct, then there are a few points to consider: - It would be important to compare against the predictive entropy from the LLM itself, i.e., consider the (re-normalized) Yes/No distribution conditioned on the reasoning step and compute its entropy as score; this baseline will validate if the classification training is indeed needed. - The claims about “comparing” against models that are 150x, 810x times larger sounds misleading since the method still requires inference over a base LLM to extract features, so all the parameters of the base LLM are also activated in the process of generating verification. This should be counted as well if the goal is to compare model sizes. - There are also strong claims about the UHeads being general, plug and play, and that they “generalize across tasks, languages and domains”. These claims are unclear and unjustified. From the paper description, these are classification models trained on top of self-supervision or even DeepSeek-R1 labels, so we need proper evidence to support these claims, otherwise I would expect them to behave similarly to other Adapter models in the literature. - The Related Work section is very superficial. It provides a little of contextualization and does not contrast with other similar works. There is also recent work in uncertainty-aware step-wise verification missed, e.g., [1, 2]. - The paper does not report confidence intervals to assess statistical significance in the results. In fact, the paper does not mention how many experimental seeds were used (I assume it is a single one). Prior literature has raised how sensitive math reasoning benchmarking is for small changes [3], requiring a more statistical grounding to evaluate whether the reported takeaways are meaningful or just observation noise. - As an illustrative example, Figure 3 (left) is used as evidence to claim scaling improvements for the proposed method. The reported gap in performance is less than 1% of accuracy (over Qwen2.5-Math-PRM-7B), which diminishes as N increases. There is no way to assess statistical significance here, yet the paper claims the “consistently better results”. The same lack of statistical rigor extends to all reported experiments, which makes it hard to evaluate scientific claims. Overall, I believe the paper requires a good rewriting in the methodological section to improve clarity on the proposed method. Some of the claims (as described above) needs to be calibrated, and the experiments should report performance across different seeds with proper confidence intervals. The related work section may also be polished to better contextualize with the literature and contrast with similar methods. References: [1] Cao et. al. More bang for the buck: Process reward modeling with entropy-driven uncertainty, 2025. [2] Ye et. al. Uncertainty-Aware Step-wise Verification with Generative Reward Models, 2025. [3] Hochlehnert et. al. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. COLM, 2025. N/A	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next