ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	2.00	3.00	4421
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	3 (75%)	4.00	4.00	3287
Fully human-written	0 (0%)	N/A	N/A	N/A
Total	4 (100%)	3.50	3.75	3570

Title	Ratings	Review Text	EditLens Prediction
Intra-Trajectory Consistency for Reward Modeling	Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces an intra-trajectory consistency regularization method to refine reward models by propagating coarse-grained, response-level supervision into fine-grained learning signals using a Bayesian-inspired principle. The approach improves performance on RewardBench and enhances DPO-aligned policies. 1. The theoretical analysis effectively supports the arguments. 2. Introducing token-wise information during the reward model training phase demonstrates innovation. 1. The additional computational overhead introduced during training is non-negligible. Training a 2B reward model along with a 2B generator should be compared with an ablation study involving training a standalone 3B–4B reward model for a more appropriate evaluation. 2. Generator mismatch is a common issue. On one hand, during RLHF, the reward model size may be significantly smaller than the actor model. On the other hand, the distribution of the actor model can shift considerably as training progresses. Although the authors conducted related experiments in the Appendix, I believe these are insufficient and require more ablation studies involving different model sizes, model series, and training steps. 3. The baselines are overly simplistic. Many methods exist for enhancing reward models, and GRM is already over a year old. I suggest including more baselines for comparison. 1. Why do larger training sample sizes lead to worse performance in Table 1 and Table 2? I also checked the original GRM paper, and the data presented there differs from what is shown in your paper. Is there an explanation for this discrepancy? 2. Why use DPO? The main purpose of using a BT model is to address reward modeling issues in PPO. For algorithms like DPO, which only require preference information, directly training on the original preference dataset would be more straightforward. Introducing an intermediate reward model seems redundant and may add noise, which I find confusing. I recommend adding experiments with PPO. Moreover, your baseline GRM has been evaluated on PPO and BoN, not DPO.	Lightly AI-edited
Intra-Trajectory Consistency for Reward Modeling	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Basically, this paper addresses a key limitation in traditional reward modeling: the reliance on coarse, response-level preference labels. This mechanism hinders the model's ability to identify specific high-quality segments within a response, often leading to poor generalization. To mitigate this, the authors introduce Intra-Trajectory Consistency Regularization (ICRM), a novel method designed to propagate response-level supervision to a more fine-grained, process level. The core mechanism enforces consistency between the reward values of adjacent generation steps, weighted by the next-token generation probability from a separate, frozen generator model. This encourages the reward model to learn smoother and more meaningful reward landscapes without incurring the high cost of manual, process-level annotations. The empirical validation is comprehensive, demonstrating that ICRM achieves statistically significant improvements on the RewardBench benchmark and that these gains translate directly into superior performance in downstream applications, including guiding DPO policy optimization and enhancing selection accuracy in Best-of-N inference-time verification. However, primarily, the discussions of reward modeling in this context typically center on PPO-like or GRPO-like RL methods; therefore, the paper would be strengthened by presenting more extensive results in this area. Furthermore, a key motivation for DPO is its relative simplicity and convenience compared to traditional RL-based methods. This paper, however, trains a separate reward model to generate improved preference data, arguably re-introducing complexity and additional overhead. To improve the paper's soundness, the authors should provide more direct comparison experiments with established RL-based methods, such as PPO and GRPO, RLVR ,including the result comparison and the time&resource cost. In summary, while the proposed method is interesting, the paper requires significant revision to address these limitations before it can be considered for acceptance at the ICLR 2026 conference. The primary strength of this paper is its novel, intuitive, and highly practical regularization method. By linking reward consistency to generator probabilities, it offers a good approach to inject fine-grained learning signals from coarse-grained data, presenting a significant practical advantage over methods that rely on labor-intensive, step-wise human annotations. This core contribution is supported by rigorous experimental evaluation. The authors convincingly demonstrate that improvements on a standard benchmark like RewardBench are not merely superficial but yield tangible benefits in critical downstream tasks, such as RLHF and inference-time verification, with results further corroborated by human evaluations. What's more, the authors provide in-depth analysis, including extensive ablation studies that substantiate the design choices of the proposed loss function, an investigation of length bias, and robustness checks against generator mismatch. The experimental results bring significant effectiveness to the method. Despite its strengths, the paper possesses several weaknesses that should be addressed: 1) The evaluation lacks sufficient comparison to RL-based methods methods. Basically, the discussions of reward modeling are often about PPO-like or GRPO-like algorithms; thus, the paper would be strengthened by presenting extensive results comparing ICRM-enhanced DPO against these methods on more benchmark datasets. Furthermore, a key motivation for DPO is its simplicity relative to complex RL-based pipelines. The proposed method re-introduces a separate, trained reward model, which adds complexity and overhead. To establish the paper's soundness, a direct comparison with methods like PPO, GRPO, and RLVR-based methods is essential, and this comparison should evaluate not only downstream performance but also the associated time and resource costs. 2) The theoretical motivation in Section 3.1 is presented as a formal derivation from a Bayesian framework, yet it appears to be overstated. The step equating a scalar reward value with a conditional probability is a significant conceptual leap rather than a mathematically rigorous step. This framing undermines the credibility of the stated theoretical foundation. The work would be more defensible if this section were reframed as providing the intuition and motivation for the approach, rather than presenting it as a formal proof. 3) The methodology introduces several complex mechanisms, such as the mean-centered calibration technique and the mutually weighted binary cross-entropy loss, without adequate justification in the main text. The authors do not sufficiently explain why these specific formulations were chosen over simpler alternatives, such as a standard L1 or L2 loss. While these justifications are provided in the appendix, their absence from the core methodology section makes the design feel arbitrary and less compelling. Integrating these critical design rationales into the main paper is recommended. 4) The introduction and related works session is not sufficient. The authors should more explicitly differentiate their work from process-supervised models like PRM, emphasizing the primary advantage of achieving fine-grained supervision without requiring fine-grained labels. Furthermore, a sharper distinction should be drawn between ICRM, which regularizes the final reward values based on generation dynamics, and other methods (e.g., GRM) that regularize the model's hidden states. See weakness.	Lightly AI-edited
Intra-Trajectory Consistency for Reward Modeling	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Reward model training usually uses coarse, response-level supervision, which can miss which parts of a trajectory actually drive the final score and can overfit to spurious cues (e.g., length). The paper proposes propagating this coarse signal to intermediate steps. By adding a regularizer so that within the same response, adjacent prefixes receive more similar rewards when the next-token probability is higher. - It supplements the standard BT outcome loss; no extra labels are needed, just next-token probabilities from a (frozen) generator. - With only response-level labels, ICRM approaches process-supervised PRMs and even boosts a process-reward model when combined. - Improvements hold across reward model benchmarks, RLHF policies, and inference-time verification, and extend to code generation. [minor weakness] - table2 reasoning section, the bold is wrongly inserted (Classifier + label smooth shows higher performance) - Figure2, two methods are not distiguishable, it would be better to use different color to be compared better - Figure3, in headmap, the color bar to show the scale is missed, and it would be better to use dense color scale to show the difference clear [weakness] - You weight consistency by the model’s next-token probability, but the generator is not guaranteed to be calibrated. How sensitive is your method to miscalibration? - Why did you choose the sampled token’s probability instead of distributional uncertainty metrics like entropy or a margin score? - Because the LM is conditioned on the prefix, next-token probabilities can vary with token position in the text. Do you observe any position-dependent trends in your regularization term? - Your tokenizer is BPE, so tokens don’t necessarily align with words. However, it seems in your example in Figure3, the the term seems splitted exactly aligned with word boundary. Did you distinguish within-word subtoken transitions from cross-word transitions when applying consistency? - If you randomly choose adjacent tokens , do you still see gains? This would isolate how much of the effect comes from probability-based selection itself, versus smoothing any adjacent pair. See above	Lightly AI-edited
Intra-Trajectory Consistency for Reward Modeling	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes an Intra-Trajectory Consistency Regularization (ICRM) method for reward modeling, aiming to improve credit assignment along reasoning trajectories without requiring explicit step-level labels. The key idea is to enforce consistency between neighboring reasoning steps by introducing a probabilistic regularization term, which supposedly yields more coherent and fine-grained process rewards. The authors claim that ICRM outperforms generalized reward models (GRMs) and enhances downstream reasoning tasks such as best-of-N reranking and RLHF-style fine-tuning. 1. The work addresses a meaningful problem in reward modeling, how to approximate process-level supervision from outcome-level signals, which is timely and important for reasoning-heavy LLM tasks. 2. The idea of linking neighboring steps via a Bayesian decomposition is conceptually interesting and could potentially bridge ORM and PRM methods. 3. The authors evaluate across multiple tasks and models, using RewardBench and RLHF-style setups. 4. The paper provides ablations and partial visualizations (heatmaps, section D) that help interpret the proposed regularization. 1. Main claim not empirically supported: The paper claims that ICRM improves credit assignment along trajectories, but the presented heatmaps primarily reflect accumulative error effects (probabilities naturally decay toward later tokens in autoregressive models). Without disentangling this from the regularizer’s effect, it is unclear whether credit assignment actually improved. 2. Visualization is confounded by probability decay: Figure 3’s token-level heatmap shows reward decline mainly in later segments, which likely results from accumulated log-probability decay rather than effective intra-step credit redistribution. No analysis isolates the effect of the regularizer from this autoregressive bias. 3. Selective reporting and overstatements: Claims such as “ICRM surpasses all GRMs” are overstated. Tables show categories where GRM still performs better (e.g., Chat domain). The improvements are average, not universal. 4. Lack of statistical rigor: Standard deviations and multiple random seeds are missing for most key metrics. RLHF evaluations rely solely on a single automatic judge (QRM-Llama3.1) rather than human ratings, making results less reliable. 5. Credit signal concentrated at sequence end: Section D.10 shows that ICRM primarily improves error detection in later parts of the trajectory, with little gain in early steps. This suggests that the method strengthens terminal penalties rather than distributing credit more effectively throughout the process. 6. No control for probability influence: Since the regularization explicitly uses next-token probability as its weighting factor, the correlation between probability and reward consistency should be empirically verified. The authors provide no statistical analysis (e.g., correlation plots or regression residuals) to support the assumed relationship. 7. Weak comparison baseline: The paper only compares against GRMs, but recent implicit PRM and DPO-Q works (e.g., Yuan et al., 2024; Rafailov et al., 2024) already achieve process-level supervision from outcome labels. Without including those baselines, it is unclear whether ICRM offers any genuine advantage. 8. Interpretation inconsistencies: The authors attribute posterior reward decay to “better credit distribution,” yet the data could equally reflect probability mass shrinkage or end-of-sequence effects. This ambiguity is never addressed. 1. Can you provide an analysis separating the influence of next-token probability from the effect of consistency regularization? For instance, show residual rewards after regressing out token probabilities. 2. How does ICRM perform if the regularizer is disabled but probabilities are still normalized? Does the improvement persist? 3. Why are there no standard deviations or multiple random seeds reported for the main tables? 4. Have you compared ICRM directly against implicit PRM or DPO-based Q-value approximations, which also derive process-level feedback from outcome labels? 5. Can you visualize cases with early or mid-sequence errors rather than only late errors to test whether ICRM truly improves step-level credit assignment? 6. Does the observed performance improvement hold when the generator and reward model come from different distributions or architectures?	Fully AI-generated

PreviousPage 1 of 1 (4 total rows)Next