ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	6.00	2.00	1896
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	1 (25%)	6.00	3.00	4296
Lightly AI-edited	1 (25%)	2.00	5.00	5353
Fully human-written	1 (25%)	6.00	3.00	1794
Total	4 (100%)	5.00	3.25	3335

Title	Ratings	Review Text	EditLens Prediction
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Self-Normalized Importance Sampling with a Baseline (SNIB), which is a critic-free policy optimization algorithm for aligning LLMs. The authors observed that GRPO uses theoretically unsound arithmetic mean of token-level importance ratios, which have high variance and scale poorly with sequence length, while GSPO uses geometric mean of token ratios but introduces non-vanishing bias, which distorts the reward-KL trade-off. The authors show that SNIB is both stable and asymptotically correct. The key idea is to use the true sequence level importance weight but to stabilize it with self-normalized importance sampling. Experiments show that the proposed method outperforms GSPO on several math and code benchmarks and it is more robust to reward noise. Articulating the stability vs. correctness dilemma as the heart of current critic free RLHF is clearly done in the work. The analysis of why GRPO and GSPO are flawed provides strong motivation for the research. The ablation shown in Fig. 2 is reasonable and convincing to show that SNIB effectively reduces the high variance of vanilla IS. The motivation for SNIB and its class of algorithms is to replace the standard PPO with a critic. However, the PPO with critic baseline is absent from all comparisons. It would be great to see if SNIB matches the performance of standard PPO with critic, or it closes the gap whereas GSPO and GRPO do not. The paper claims that GSPO is biased while SNIB is principled and asymptotically unbiased. However, it seems SNIB is also biased at any finite batch size G. The claim is only that this bias is asymptotic and vanishes as O(1/G). The PPO clipping, although as the authors noted, is also a source of bias, which does not vanish with batch size. Please see weakness part.	Fully human-written
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper identifies a critical dilemma in critic-free RLHF: existing methods are either theoretically unsound or biased. Group Relative Policy Optimization (GRPO) suffers from high variance by improperly mixing sequence-level advantages with token-level importance sampling (IS) ratios . Group Sequence Policy Optimization (GSPO) achieves stability by using a geometric mean of token ratios, but this is a biased estimator that optimizes a "perturbed objective" and distorts the crucial reward-KL trade-off . The authors propose SNIB (Self-Normalized Importance Sampling with a Baseline), a novel algorithm that resolves this dilemma. SNIB is both stable and asymptotically correct. It uses the theoretically correct sequence-level IS weight (the product of token ratios) and achieves stability by applying two principled techniques: Self-Normalization: It normalizes each sample's IS weight by the average weight of the entire batch, which provably dampens outliers and reduces variance . Baseline: It uses the mean batch reward as a baseline to further reduce variance in the advantage estimates . The paper provides strong theoretical backing, proving SNIB's gradient estimator is consistent and asymptotically unbiased (with bias vanishing at O(1/G)) . This principled design is shown to be more robust to reward model uncertainty and, unlike GSPO, preserves the principled KKT conditions of the constrained reward-KL optimization problem . Empirically, SNIB outperforms GRPO and GSPO on challenging mathematical reasoning and code generation benchmarks Principled and Novel Solution: The proposed solution, SNIB, is elegant. It correctly insists on using the true sequence-level IS weight, and then intelligently applies self-normalization —a statistically-grounded variance reduction technique—to solve the exact stability problem that plagued naive IS (shown in Fig 2). Strong Theoretical Guarantees: The method is built on a solid theoretical foundation. The paper proves that SNIB is asymptotically unbiased (Proposition 1) and that this unbiasedness preserves the underlying KKT conditions of the constrained optimization problem, which biased methods like GSPO do not (Proposition 2). Missing PPO Baseline: The entire motivation for critic-free methods is to replace the expensive, critic-based PPO . However, PPO is not included as a baseline in the main results (Table 2). This makes it impossible to assess the full picture. We can see SNIB is better than other critic-free methods, but how much performance (if any) are we sacrificing compared to PPO for the gain in efficiency? Limited Task Domain and Reward Type: The experiments are conducted exclusively on math and code generation tasks, using ground-truth correctness as the reward signal. While this is a very clean and sound way to test the algorithm, it doesn't demonstrate the method's performance in the more common (and noisy) RLHF setting with a learned reward model on subjective tasks like "helpfulness" or "harmlessness." The reward noise experiment (Fig 1a) is a good simulation, but not a substitute for a real-world test. Sensitivity to Group Size G: The theory states that SNIB's bias is on the order of O(1/G). The experiments use a group size of G=8, which seems small and may imply a non-trivial bias in practice. The paper does not include an ablation study on G, which is a key hyperparameter for both performance and efficiency. Given that the primary motivation is to find an efficient alternative to critic-based PPO, why was PPO omitted from the main performance comparison in Table 2? A direct comparison is needed to understand the full performance-vs-efficiency trade-off. The experiments are on math/code tasks with ground-truth rewards. How do you expect SNIB's performance to translate to the more common RLHF setting using a noisy, learned reward model for a subjective task like "helpfulness"? Your analysis in Figure 1a is promising, but is additive Gaussian noise a sufficient proxy for the complex, correlated noise from a learned RM? The theoretical bias of SNIB is O(1/G), and a small group size of G=8 was used. Have you performed a sensitivity analysis on G? How does the performance and stability of SNIB change with a smaller (e.g., G=4) or larger (e.g., G=16) group size?	Moderately AI-edited
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes SNIB (Self-Normalized Importance Sampling with Baseline), a critic-free RLHF algorithm that unifies the stability of GSPO with the theoretical correctness of unbiased policy gradients. SNIB uses self-normalized importance sampling to reduce variance while remaining asymptotically unbiased. Theoretical analysis proves its consistency, robustness to reward noise, and preservation of the KL–reward trade-off. Experiments on math and coding reasoning benchmarks show moderate but consistent gains over GRPO and GSPO. 1.Clear theoretical contribution: principled, asymptotically unbiased estimator for critic-free RLHF. 2.Well-presented mathematical analysis, including variance, convergence, and KKT proofs. The paper is well-written, well-structured, and effectively connects theory with practical implications for RLHF pipelines. 2.Empirical results demonstrate improved stability and robustness to reward noise. 1.Experiments limited to math/coding tasks — generalization to dialogue or multimodal alignment is unclear. 2. Lack of comparison with more recently RLHF baselines. 3. Performance improvements are modest given the theoretical complexity. Although SNIB improves theoretical soundness, its empirical advantage over GSPO is relatively small (1–2% absolute in most benchmarks). The improvements, while consistent, may not justify the added conceptual and implementation complexity. 4. Key design components (self-normalization, baseline, stop-gradient) are not individually ablated, making it difficult to attribute improvements precisely. The paper would benefit from comparing fully differentiable vs. stop-gradient SNIS. 1. Does the stop-gradient version compromise the theoretical unbiasedness claimed? 2. Can SNIB be integrated with existing GRPO/GSPO infrastructures in practice? 3. How does SNIB behave under severe reward model bias rather than random noise?	Fully AI-generated
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes Self-Normalized Importance Sampling with a Baseline (SNIB) to address the bias and high-variance issues that accumulate over long sequences when estimating importance sampling $\pi_\theta(y\|x)/\pi_{\text{old}}(y\|x)$ in GRPO and GSPO. SNIB is consistent and asymptotically unbiased, ensuring convergence to the correct policy objective. Experimental results demonstrate that SNIB is more robust to adversarial reward perturbations and achieves a better Reward–KL trade-off compared to prior methods. - The theoretical analysis of self-normalized importance sampling is rigorous and well-justified. - The paper is poorly written; many key contributions and analyses, particularly those analyzing the high-variance issues in prior methods, are either missing or insufficiently explained in the main text. - Proposition 1, which shows that the SNIB estimator is consistent and asymptotically unbiased, is an important theoretical property. However, this result is not novel and well-known [1]. - Unrealistic reward model uncertainty experiment: The noisy reward experiment is unrealistic, as the authors simply add random Gaussian noise $\epsilon\sim\mathcal N(0,\sigma^2)$ to the rewards during training. In the context of LLM alignment, such noise cannot capture the complexity of modeling uncertainty and context dependence in human preferences (see [2, 3]). Moreover, in mathematical reasoning tasks, we typically have access to a ground-truth reward function (as also used in the paper), which can reliably provide learning signals for LLMs [4, 5]. Therefore, the results presented in Fig. 1 do not convincingly demonstrate SNIB’s robustness under realistic scenarios. - The main results in Table 2 show that GRPO achieves substantially better performance than SNIB on three out of four mathematical reasoning benchmarks, which raises questions about the effectiveness of SNIB. ## References. [1] Monte Carlo theory, methods and examples - Art Owen. [2] Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF. ICLR 2024. [3] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. ICML 2024 Oral. [4] DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 2025. [5] DAPO: An Open-Source LLM Reinforcement Learning System at Scale. NeurIPS 2025. [6] Scaling Laws for Reward Model Overoptimization. ICML 2023. [7] Training language models to follow instructions with human feedback. NeurIPS 2022. [8] Understanding the performance gap between online and offline alignment algorithms. CoRR 2024. [9] VL Norm: Rethink Loss Aggregation in RLVR. arXiv:2509.07558 - Since SNIB remains a biased (but consistent) estimator, it is important to analyze the bias–variance trade-off and compare the performance of SNIB, GRPO, and GSPO as the number of responses per prompt increases. - The variance analysis in the paper focuses primarily on the importance weights. It is also necessary to evaluate SNIB’s ability to stabilize training by measuring gradient variance as sequence length increases, compared to GRPO and GSPO. Without length normalization, the gradient variance can grow proportionally with response length [9]. - From my experience, the normalized weights can be computed using a softmax operation: $\text{Softmax}(\{\log(\pi_\theta(y_i\|x)-\log\pi_{\text{old}}(y_i\|x)\}_{i=1}^G)$. However, the exponential function in the softmax can amplify discrepancies between response groups, allowing longer sequences to dominate the learning signal due to higher variance. Could this lead to a long-sequence bias, where SNIB favors longer responses? If so, is this effect detrimental or beneficial for exploration? It would be valuable to visualize the distribution and entropy of normalized importance weights $w$ to better understand this phenomenon and to help tune the clipping parameter $\epsilon$. Additionally, does SNIB tend to produce longer sequences or higher-entropy outputs (more explorative) compared to GRPO and GSPO? - While GSPO indeed suffers from variance accumulation over sequence length, GRPO—when formulated under a token-level MDP—employs token-level importance sampling with per-token clipping, which effectively mitigates high variance at each timestep. The paper should contrast SNIB with GRPO, explicitly showing failure modes of token-level importance sampling and explaining why GRPO achieves better empirical performance compared to SNIB. - The paper claims that SNIB provides a more predictable and principled trade-off between reward maximization and KL regularization compared to other estimators. However, in Fig. 1(b), all three estimators exhibit a similar trend where increasing $\beta$ lowers the average reward. The authors should clarify what makes SNIB “more predictable” than GRPO and GSPO under this observation. - KL regularization is commonly used to prevent the model from deviating too far from the initial policy distribution, thereby reducing reward hacking [6, 7]. However, in mathematical reasoning—where a ground-truth reward exists—KL regularization can actually hinder learning [5]. Could SNIB’s predictability help identify an optimal $\beta$ when we only have access to a proxy reward for training, without access to a golden reward function (as in [6, 8] reward hacking setting)?	Lightly AI-edited

PreviousPage 1 of 1 (4 total rows)Next