ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 3.00 2489
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 8.00 3.00 1407
Fully human-written 1 (33%) 4.00 2.00 1181
Total 3 (100%) 6.00 2.67 1692
Title Ratings Review Text EditLens Prediction
How reinforcement learning after next-token prediction facilitates learning Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies why RL after next-token prediction helps large language models learn better reasoning. The authors build a simple setting where training data mixes short and long “chain-of-thought” examples. They theoretically and experimentally prove that next-token prediction alone often fails when long samples are less than 1/3 in pretraining dataset. And RL can quickly improve learning by focusing on longer samples, which leads to longer and more correct responses. 1. Strong theoretical proof combined with solid experiments, showing when and why RL succeeds. 2. The paper gives a simple and convincing explanation of why RL helps reasoning. Providing clear insights. 3. The answer of two core questions—"why RL works" and "why length increases" could offer a new design direction for LLM reasoning optimization. 1. The theory assumes fully correct CoT examples. However, the real pretrain data usually contains noisy or wrong data. And even positive RL trajectories could contain noise or false-positive ones. The paper lacks discussion of these factors. 2. The theoretical proof could be written in a more organized manner, including formulas. This would make that part easier to understand. 1. How robust are the theoretical conclusions if long CoT samples contain noise? 2. Can you still observe “length-driven generalization” when RL rewards contain a certain proportion of length penalty? Lightly AI-edited
How reinforcement learning after next-token prediction facilitates learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a framework to theoretically understand the success the popular LLM training paradigm that RL post-training after SFT. This paper also uses the parity check experiments to show that RL enables the model to generalize to difficult tasks while SFT cannot. The paper also demonstrated this phenomenon on math problems. - the paper is well-motivated - trying to understand the reason behind the current successful LLM training paradigm. - the paper proposed a relatively comprehensive theoretical analysis framework. - the paper empirically validated its perspective through a cleverly designed and controlled parity check experiment. 1. The theoretical proof relies on a simplified linear autoregressive model. 2. Some of the ideas of this paper are already pointed out in papers like DeepSeek-R1. 2. This paer lacks of practical guidance for future LLM training and new algorithm. - Could similar phenomena be observed when in a worded version of a parity task? Such as giving the task description as language input of a LLM. - The theoretical analysis relies relies on a simple linear framework, how is it extended to non-linear structures like the Transformers? Fully human-written
How reinforcement learning after next-token prediction facilitates learning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper explains why large language models improve dramatically when reinforcement learning (RL) follows next-token prediction training. The authors prove that next-token prediction alone cannot generalize on hard tasks if long reasoning chains are rare, whereas RL amplifies these rare sequences and enables efficient learning. Experiments on both synthetic and real-world reasoning tasks confirm that RL boosts accuracy and increases response length. The results highlight RL’s role in unlocking generalization from limited but valuable reasoning data. * The paper is the first to provide a formal separation between next-token prediction and next-token prediction plus RL in autoregressive models, giving a novel theoretical account of why the widely-used “pre-train then RL” recipe succeeds. * By showing that RL can turn a sample-hard, exponentially data-hungry problem into a polynomial-time solvable one whenever long demonstrations are merely polynomially rare, the work directly informs how scarce reasoning data should be leveraged in large-scale model development. * All conclusions of the paper are drawn around the task of predicting the parity of d bits given access to a source of sequences, and its generalization to more common reasoning tasks (science or open-ended) is questionable. * There is a lack of experiments on a broader range of and more recent LLMs, e.g., qwen3. * Theorem 1 gives $p_{cot} < 1/3$ as the exact point where greedy decoding stays short. Is this an artifact of the two-step linear decision model, or does it survive richer embeddings (e.g., transformers with non-linear MLPs)? An ablation that keeps the data distribution but increases model expressivity would clarify whether the threshold is distribution- or architecture-specific. * The post-training bound hides the per-round sample size inside Õ(·) and assumes fresh data every round. How many total unique prompts does the algorithm really need? * Parity has a single deterministic “long path”. Real reasoning data often contain many valid chains of varying length and quality. Does the two-component mixture still capture the dynamics, or does the presence of noisy/partial chains shift the critical $p_{cot}$ or require a different RL objective? * The theory 2 requires $p_{cot} \in \Omega (d^{-\kappa})$. For $d\approx 1,000$ (typical for LLM prompts) this seems prohibitive. Is the polynomial dependence tight, or do transformers empirically succeed with much smaller constants? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next