ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 2964
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 2.00 4.00 1542
Fully human-written 2 (50%) 4.00 3.00 1714
Total 4 (100%) 3.50 3.50 1984
Title Ratings Review Text EditLens Prediction
SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression Soundness: 2: fair Presentation: 3: good Contribution: 4: excellent Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes SIRI, a training regime for Large Reasoning Models (LRMs) that aims to solve the performance-efficiency trade-off1. The core idea is to iteratively alternate between "compressing" and "expanding" the model's reasoning budget by dynamically adjusting the maximum rollout length $L$ during reinforcement learning. The authors use a "length scheduler," (e.g., a cosine scheduler) to manage this alternation. The empirical results on benchmarks like AIME24 are strong—achieving high accuracy with reduced token counts Lack of Theoretical Guarantee or Novel Algorithm: Entirely Heuristic: The method is motivated by a hypothesis based on a visual inspection of a previous paper's training curve. There is no formal analysis of why this compression-expansion cycle is beneficial. No New Algorithm: The paper does not introduce a new loss function or make any fundamental modifications to the RL algorithm. It builds on existing methods (GRPO/DAPO) and primarily proposes a novel scheduling strategy for one of the training hyperparameters—the maximum length $L$. The proposed schedule resembles a cosine learning rate schedule. Weak Post-Hoc Analysis: The analysis provided is purely observational. For example, the entropy analysis in Appendix A.2 is interesting, but it merely observes that entropy oscillates. Worse, it even notes that the baseline model also shows periodic entropy fluctuations, which confuses the claim that this oscillation is the unique, driving factor behind SIRI's success . Please see the weakness Lightly AI-edited
SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces SIRI, a training method for Large Reasoning Models. The core idea is to improve both reasoning accuracy and token efficiency through an iterative process. During training, the model alternates between a compression phase with a short generation limit and an expansion phase with a longer limit. The compression phase forces the model to generate more concise and dense reasoning, while the expansion phase allows for exploration and planning. The authors show empirically that this cyclical training pushes the model towards better performance and efficiency, outperforming existing methods on mathematical reasoning benchmarks. 1. The proposed method is simple, intuitive, and easy to implement. It introduces a novel training curriculum by dynamically adjusting the maximum generation length using a scheduler. 2. The paper provides strong empirical evidence to support its claims. The experiments are conducted on two different model sizes and evaluated on multiple standard mathematical reasoning benchmarks. The results in Table 1 clearly demonstrate that SIRI improves both accuracy and token efficiency, surpassing strong baselines. 3. This paper presents a thorough analysis of the method's behavior. It goes beyond final performance numbers and investigates the training dynamics, the effect of different schedulers, and the changes in the model's output patterns. 1. The conceptual novelty of the method appears to be an incremental extension of prior work. The idea of a compression phase followed by an expansion phase was already present in the DeepScaleR baseline. The primary contribution here is making this process iterative, which feels more like a refinement than a fundamentally new approach. 2. The comparison with baseline methods could be more robust. The results for several key baselines are incomplete in Table 1, which weakens the comparative claims. Additionally, the main competitive baseline, DAPO-DeepScaleR, is an author implementation, which raises questions about whether it was optimally tuned for a fair comparison. 3. The explanation for why the method works is based on indirect evidence. The analysis linking performance gains to the frequency of specific keywords like "wait" is correlational. It does not provide a deep, causal understanding of how the model's reasoning structure is fundamentally improved by the iterative training. 1. The method can be viewed as a form of curriculum learning. How does the proposed iterative oscillation compare to a more standard curriculum that simply starts with a short generation length and gradually increases it over time without the compression cycles? 2. Regarding the DAPO-DeepScaleR baseline, could you provide more detail on its implementation? Specifically, was its single expansion phase given a comparable number of training steps to one of your expansion phases to ensure that the performance difference is due to the iterative nature of SIRI? Fully AI-generated
SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces SIRI, a framework for training LRMs that enhances reasoning accuracy while reducing token usage by dynamically alternating between compression and expansion phases during RL. SIRI employs a scheduler to adjust maximum rollout lengths, forcing concise decision-making in compression phases to eliminate redundancy and enabling exploration in expansion phases for long-horizon planning. SIRI improves performance on math benchmarks, outperforming baselines such as DeepScaleR and AdaptThink by pushing the Pareto frontier of efficiency and accuracy. - Paper is well presented, and easy to read - Detailed empirical analyses, empirically validating the ideas of the proposed method - Proposed method is easy to implement, and leads to a SOTA performance in terms of Pareto frontier. - Compares multiple scheduler choices - The main idea of the paper does not seem very novel, as it is extending DeepScaleR's compression-extension approach into the iterative training framework. - Including a new scheduler adds a number of hyperparameters, e.g., scheduler type, L_max, L_min, T. Tuning such hyperparameters can significantly increase the computational burden. - Only validated on DeepSeek R1 Distill Qwen models - Some recent papers on RLVR report that compressing the number of thinking tokens can degrade the model's general performance on different tasks other than the one being trained. Is there any severe degradation on model performance on general tasks when the compression is repeatedly done? - While in the ablations it is discussed that scheduler with a longer cycles performs best, DeepScaleR is outperformed by proposed method where DeepScaleR can also be seen as a form of SIRI with extremely long cycles. What would be the potential reasons that the proposed method outperforms DeepScaleR? How would the performance trend be if we further increase the cycle lengths more than those experimented in the paper? Fully human-written
SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Oscillating response length helps the iterative on-policy RL mechanism such as GRPO. * A simply designed length scheduler helps improve accuracy as iterations go by, as illustrated in Figures 1 and 4. * If it is indeed true that accuracy increases when training with GRPO using responses generated through compression and expansion cycles, this represents a quite intriguing discovery. * While the paper provides empirical validation (Fig 4, 5, 7), the core mechanism in Figure 2(b) is a hypothesis without theoretical justification. * Unsure that the finding is statistically valid—Standard error or confidence interval is not reported in any experiments. * The stylized dynamics in Figure 2(b) do not align well with the actual training curves in Figures 4 and 5. The paper appears to be drawing connections between the hypothesis and results that may not genuinely exist. * The way of using the length scheduler is not clear. Did you include the term in the GRPO objective and add additional parameter to control its effect? * How can we determine the causal relationship between compression and efficiency gains and between expansion and exploration capability? * Recently, I read the paper below [1], which showed that GRPO leads to longer responses. The authors removed two terms to make it more stable and control the response length. This might be relevant to your work on analyzing response length. [1] Understanding R1-Zero-Like Training: A Critical Perspective, COLM 2025 Fully human-written
PreviousPage 1 of 1 (4 total rows)Next