ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 4.00 3424
Lightly AI-edited 1 (25%) 4.00 3.00 2016
Fully human-written 2 (50%) 7.00 3.00 1566
Total 4 (100%) 5.50 3.25 2143
Title Ratings Review Text EditLens Prediction
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a reinforcement learning framework for training diffusion language models. The approach builds upon group sequence-level importance sampling ratios for diffusion language models and incorporates an additional KL regularization term (referred to as K2-type regularization by the authors). The method is evaluated empirically on several tasks: Countdown, mathematical reasoning, and coding problems, demonstrating performance improvements over baseline approaches. - The paper is well-written with a logical structure that makes the technical content accessible. The progression from problem formulation to methodology to experimental validation is easy to follow. - The experimental evaluation demonstrates notable improvements on both the Countdown and math coding tasks, suggesting the proposed approach is effective for the target applications. - The proposed method largely combines existing techniques from prior work (Zheng et al., 2025; Tang & Munos, 2025b) without introducing new algorithmic components or theoretical insights. The contribution appears primarily incremental, adapting established methods to the diffusion language model setting rather than developing new approaches tailored to the unique characteristics of these models. - While the empirical improvements are encouraging, the paper does not address the fundamental question of what constitutes an appropriate RL formulation for diffusion language models. The current approach treats the problem as one of better approximating log-probabilities, but a more principled direction would be to re-derive the policy gradient theorem specifically for diffusion language models from first principles. Critical questions remain unanswered: What is the proper form of policy gradients for diffusion language models? Does the "log-probability" term even appear in such gradients, or should the formulation take a fundamentally different form? Without addressing these foundational issues, the work risks building on potentially unsuitable assumptions. **Justification for sequence-level formulation:** The paper does not provide clear insight into why the proposed sequence-level formulation outperforms the token-level formulation. Is the advantage due to reduced bias, reduced variance in the policy gradient estimates, or both? A quantitative analysis comparing the bias-variance tradeoffs of both formulations would strengthen the paper's claims. **Hyperparameter selection:** How was the clipping parameter $\epsilon$ in Proximal Policy Optimization chosen? Were separate hyperparameter searches conducted for the token-level and sequence-level formulations, or was the same value used for both? This is important for ensuring fair comparison between the two approaches. **Role of KL regularization:** The results in Figure 2 raise several questions about KL regularization: The figure suggests that KL regularization enables a significant performance breakthrough in later training stages, which is unexpected. Typically, KL regularization is understood to improve training stability rather than drive performance gains. Can the authors explain this phenomenon? - In standard language model fine-tuning on reasoning tasks, KL regularization is often minimal or omitted entirely. Why is it critical in this setting? - For clarity: which type of KL regularization (forward, reverse, or the K2-type mentioned) is used in Figure 1? Moderately AI-edited
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes ESPO, an RL framework tailored to diffusion LLMs which treats the whole completion as a single action. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct across math (GSM8K, MATH), coding (HumanEval/MBPP and EvalPlus), and planning (Countdown, Sudoku) show consistent gains versus token-level d1/wd1 baselines. - Using a sequence-level action space makes the method very simple - Experiments show strong improvements over the baselines which do not treat the entire sequence as an action - Seems like a straightforward application of GSPO to diffusion models novelty-wise - Not clear why per-token evaluation is necessarily bad for diffusion LLMs specifically - is it true for all LLMs (as GSPO claims) or just diffusion LLMs? See weaknesses Fully human-written
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies the problem of reinforcement learning (RL) to diffusion LLMs. Token-level objectives for autoregressive model (e.g., GRPO) require token log-likelihoods that are intractable to compute for diffusion models. The paper proposes ESPO, a sequence-level policy optimization method that treats generating the entire completion as one action and replaces the intractable sequence log-likelihood with an ELBO proxy. ESPO stabilizes training by (i) normalizing the ELBO log-ratio by sequence length and (ii) using a k2 (quadratic) KL estimator instead of the unstable exponential k3 estimator, alongside variance-reduction tricks (antithetic/coupled masking). ESPO is empirically evaluated with LLaDA-8B-Instruct and Dream-7B-Instruct on math (GSM8K, MATH), coding (HumanEval/MBPP), and planning (Countdown, Sudoku). ESPO consistently beats diffu-GRPO and wd1. Ablations show sequence-level+ELBO is the only stable formulation among tested variants and a training-cost analysis shows FLOPs/time grow mildly with Monte-Carlo samples because generation dominates compute. * The paper explains the problem setup and why token-level importance ratios lack a valid probabilistic interpretation for dLLMs fairly well. The shortcomings of existing methods also makes the motivations quite clear. * I like the idea of moving to a sequence level objective and using an ELBO-based ratio avoiding heuristic token surrogates. * The performance of the method is tested across a variety of tasks and two different base models and shows consistent improvement over the baselines. * The paper is also generally quite well written. * ESPO optimizes an ELBO difference, not the true sequence likelihood ratio but the paper does not quantify how ELBO tightness affects policy improvement or bias across tasks. * The paper misses some closely related prior work on RL fine-tuning of diffusion language models [1, 2]. I believe a comparison to these baselines would be critical. [1] Venkatraman et al., 2024. Amortizing intractable inference in diffusion models for vision, language, and control. [2] Zekri and Boullé, 2025. Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods. * Length normalization is shown to have unintended consequences on the reasoning performance for AR LLMs. I am curious if it impacts the long reasoning behaviors in your experiments? Fully human-written
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective Soundness: 4: excellent Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work proposes ELBO-based Sequence-level Policy Optimization (ESPO), a principled framework tailored for dLLMs. ESPO defines the entire sequence generation as an atomic action (avoiding token-level decomposition), employs the evidence lower bound (ELBO) as a tractable proxy for sequence-level likelihood, and utilizes a robust $k_2$ estimator for KL divergence (free from exponential instability). Extensive experiments on models like LLaDA-8B-Instruct and Dream-7B-Instruct across tasks (mathematical reasoning, coding, planning) demonstrate ESPO’s superiority: it achieves up to 60-point absolute improvements in planning tasks (e.g., Sudoku), stable gains in knowledge-intensive tasks (e.g., GSM8K, HumanEval), and superior training efficiency (only 47% cost increase when MC samples double). 1. Clearly identifies the mismatch issue under the dLLM framework, highlighting that existing RL algorithms struggle to compute token-level importance sampling, and proposes a sequence-level alternative. 2. The experiments are thorough, including extensive validation and ablation studies across both math reasoning and agent tasks, demonstrating the effectiveness of the proposed method. 3. The paper is well-organized, with clear presentation and coherent argumentation. 1. For the math reasoning experiments, could the authors provide training curves? The improvement over the baseline appears limited. 2. Why are there no experiments on Dream-7B-Instruct with the d1 baseline? 3. Why are the results on coding benchmark such as HumanEval and MBPP missing baseline comparisons? 4. The performance on Sudoku shows a sudden change. Could the authors provide a deeper explanation or analysis? 5. From Figure 2, the observed improvements seem mainly driven by $k_2$ estimator. Could the authors include wd1 and d1 with $k_2$ estimator on Sudoku as additional comparisons? I am willing to raise my score should the authors provide satisfactory responses and clarifications. Please see questions in Weaknesses. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next