ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 6.00 4.00 5196
Fully human-written 3 (75%) 0.67 4.00 1944
Total 4 (100%) 2.00 4.00 2757
Title Ratings Review Text EditLens Prediction
Are complicated loss functions necessary for teaching LLMs to reason? Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies different components of the GRPO (Group Relative Policy Optimization) loss in improving reasoning in LLMs. The authors break down the loss into its main components—negative feedback, PPO-style clipping, and advantage estimation—and test simplified versions such as positive-only GRPO, their proposed REINFORCE with Group Relative Advantage (RGRA), and direct REINFORCE. Their experiments on small models (Qwen2.5-0.5B/1.5B and Llama3.2-1B) show that removing negative feedback leads to collapse and reward stagnation and PPO clipping can be dropped without hurting performance. * Systematic studies like the one that the paper conducts is generally important for the community, especially for understanding RL post-training for LLMs. * Authors test on two different model families and take care in evaluating on a comprehensive set of benchmarks split across Chinese/English and math/other subject domains. * The model scale and setting (<=1.5B parameter models with LoRA fine-tuning) is limited and it's unclear if their findings extrapolate to larger model scales and full fine-tuning. * In particular, prior work [1] seems to show a different result that positive-only reinforcement can be competitive with GRPO/PPO provided verifiable rewards are used and poor prompts are filtered. The findings from Xiong et al. are from larger models (7B-70B), which supports the potential limitations of the model size and setting studied in this work. * The reported results lack confidence intervals and seem necessary to draw strong conclusions like the ones made in this work (in particular, how significant is the performance delta between RGRA and GRPO)? I'm sympathetic to the author's limited compute constraining them to their training setup, but multiple seeds and further performance analysis (eg. pass/majority@k performance) would strengthen their results. While the paper’s goal of simplifying GRPO is well-motivated, the evidence feels too narrow and limited to support its strong claims about the necessity of negative feedback. Minor: Some areas in the manuscript need `\citep` (Line 169, 263 to name a few). [1] Xiong, Wei, et al. "A minimalist approach to llm reasoning: from rejection sampling to reinforce." arXiv preprint arXiv:2504.11343 (2025). * The training dataset appears quite small (around 1.8k examples). Could the authors clarify why this size was chosen, and whether you observed any sensitivity to data scale? * In the positive-only advantage setup, do the authors ensure that each batch contains enough positively rewarded samples for stable gradient estimation? * In Figure 1(a), both REINFORCE and RAFT collapse only for the Qwen 0.5B model. Do you have an explanation for why this smaller model is unstable compared to the 1.5B and 1B variants? Could this have been mitigated with a cold start stage? * Regarding clipping, what $\epsilon$ value was used for GRPO runs, and roughly what fraction of updates were actually clipped during training? * The author's results seem to differ from Xiong et al., who find that positive-only RAFT remains competitive with GRPO. Could you comment on the key differences in setup (e.g., model scale, filtering, or reward structure) or clarify the discrepancy? Fully human-written
Are complicated loss functions necessary for teaching LLMs to reason? Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper does an ablation over the components of the GRPO loss function, namely advantage clipping, negative examples, and KL regularization. This paper studies an important question in RL post-training, namely which components are required in the loss function to get the models to perform well. Based on their findings, the authors propose RGRA for LLM post training. There are several major weaknesses with this paper. To begin, the framing of the paper is an ablation over the main components of the GRPO loss. However, there are several key components missing from this ablation: - as far as I understand, the authors do not sweep over the hyperparameters of any of the baselines they run. Critically, for an ablation over components of GRPO, they do not sweep over the number of rollouts, nor over the amount of steps taken off policy by the algorithm (I am referring to doing multiple gradient steps over a set of rollouts for a given batch) - the authors start from pretrained models, which can confound the results. Namely, [1] shows that the KL regularizer can have different effects based on the pretraining data - the baselines are quite trivial, especially with respect to the length of the chain of thought required to solve the prompts. In general, the terms in the loss start to be significant for long chains, or the number of offline steps taken by the algorithm Based on the above comments, I believe proposing RGRA is not currently backed by empirical results. [1] Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining No questions. Fully human-written
Are complicated loss functions necessary for teaching LLMs to reason? Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the necessity of complex loss functions, specifically Group Relative Policy Optimization (GRPO), for enhancing the reasoning capabilities of Large Language Models (LLMs). The authors conduct a systematic analysis of GRPO, an algorithm that combines group-relative advantage estimation, PPO-style clipping, and KL regularization. The paper identifies two key findings: Negative feedback is essential. Training solely on actions that outperform a baseline (i.e., positive-only advantages) or using simpler rejection sampling (RAFT) leads to training instability, performance collapse, and a failure to elicit reasoning behaviors. PPO-style constraints are unnecessary. The analysis demonstrates that PPO-style components, such as policy ratio clipping, are not required to improve mathematical reasoning performance or maintain training stability. Experiments across standard mathematical benchmarks indicate that RGRA achieves stable training dynamics and demonstrates stronger performance than the more complex GRPO, surpassing it in 17 out of 27 task comparisons. This paper's originality is strong. The authors challenge the assumed necessity of all components within the successful GRPO framework . This "less is more" approach, which rigorously questions the utility of established components like PPO-style clipping, represents an original and valuable methodological contribution. The claims are substantiated by a comprehensive and robust body of empirical evidence, including extensive quantitative benchmarks across multiple model families and languages (Tables 1-3) , crucial analysis of training dynamics and stability (Figure 1) , and insightful qualitative analysis of emergent reasoning behaviors (Figure 2). The experimental setup is sound and provides convincing validation for the paper's conclusions. The paper's primary weakness lies in its significant overgeneralization of claims from a narrow and limited experimental setup. The central conclusion that PPO-style constraints are "unnecessary" for teaching reasoning is drawn exclusively from experiments on small-scale models, ranging from 0.5B to 1.5B parameters. PPO's clipping mechanism was precisely designed to ensure stability during large, high-variance policy updates, which are a far greater concern in the state-of-the-art 70B+ models. The paper provides no evidence that its findings would hold in a large-scale setting, thus failing to adequately support its ambitious and broad claims. Furthermore, the dismissal of baseline methods like RAFT and positive-only GRPO as inherently unstable is unconvincing. Their catastrophic collapse (shown in Figure 1) is observed on a minuscule training dataset of only 1,800 instances , which is highly susceptible to reward-hacking. More importantly, the paper fails to provide evidence of rigorous hyperparameter tuning for these baselines. Their collapse could simply be an artifact of a poorly chosen learning rate or insufficient KL regularization, rather than a fundamental flaw in the methods themselves. Without a proper hyperparameter sweep to find the most stable configuration for these baselines, the paper's conclusion is not fully substantiated. The paper's central claim that PPO-style constraints are "unnecessary" is derived from experiments on relatively small-scale models (0.5B to 1.5B). Given that PPO's clipping mechanism was specifically designed to ensure stability for large scale policies with high variance updates, what justification or evidence can you provide that this finding will generalize to the 70B+ or 100B+ models where such stability constraints are traditionally considered critical? The experiments are confined entirely to mathematical reasoning, a domain characterized by sparse and verifiable binary reward signals (i.e., correct/incorrect). How do you anticipate the stability of RGRA (which lacks clipping) will hold in standard alignment scenarios (e.g., helpfulness, safety) that rely on dense, non stationary, and often noisy rewards from learned preference models? The paper concludes that methods ignoring negative feedback (RAFT, GRPO-pos) are fundamentally unstable, citing their rapid collapse on a small 1,800-instance training set. Could you elaborate on the extent of the hyperparameter search (e.g., learning rate) conducted for these baselines? How can you be certain this collapse is an inherent flaw of the methods, rather than an artifact of sub-optimal tuning or a simple case of reward-hacking on a dataset small enough to be easily exploited? The results suggest RGRA does not just match, but often outperforms GRPO. Since the key difference is the removal of the PPO clipping and policy ratio terms, what is the mechanism for this performance improvement? The paper shows that standard REINFORCE with direct rewards collapses, even on the 1.5B model, which you use to underscore the necessity of advantage estimation. Could you clarify the tuning process for this specific baseline? Does this result definitively prove that any REINFORCE-style method without advantage estimation is doomed to fail in this setting, or could this collapse also be sensitive to hyperparameter choices? Lightly AI-edited
Are complicated loss functions necessary for teaching LLMs to reason? Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper looks into GRPO training for reasoning models. I tests three variants of GRPO 1. positive only reward 2. GRPO without importance sampling 3. naive reinforce N/A * There is a severe lack of novelty in the paper. The proposed RGRA is essentially GRPO without importance sampling. * The paper is poorly composed; the results are not well organized. Figure 1 occupies an entire page without any accompanying analysis in the caption. Tables 1 and 2 are also poorly formatted, lacking proper bolding and explanations for abbreviations. From this standpoint alone, the paper feels far from complete. * There is almost no discussion regarding the differences between RGRA and GRPO. Why does removing importance sampling and not using a clipping objective lead to better training? * There is no comparison with related works at all (e.g., Dr.GRPO, DAPO, etc.). * Overall, the paper lacks a clear motivation, shows limited novelty, and provides insufficient analysis of the results. Major revisions are needed. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next