|
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces SPG for dLLMs. Because dLLMs have intractable log-likelihoods, standard policy gradients cannot be applied directly. SPG sandwiches the likelihood by maximizing an ELBO for positive advantage samples and minimizing an EUBO for negative advantage samples, paired with blockwise masking for Monte Carlo estimation. SPG achieves SOTA accuracy on GSM8K, MATH500, Countdown, and Sudoku among RL for dLLMs.
1. The paper articulates the RL bottleneck for dLLMs and motivates a natural sandwiched approach.
2. Experimental results show consistent gains across benchmarks, sequence lengths, and inference strategies; ablations on some choices further support robustness.
3. The paper is easy to follow and uses a compact structure that makes the method clear.
1. LoRA-only evidence without full fine-tuning controls: all experiments rely on LoRA, so findings may hinge on LoRA’s inductive biases; it remains unclear whether the gains and stability persist under full fine-tuning.
2. The authors mention that masking 15% of the prompt can improve performance. Although they claim this choice follows d1, they do not seem to explain here the motivation or why it leads to better results.
1. In all main experiments, the paper set the number of Monte Carlo samples to $m=2$. Ignoring computational constraints, how does increasing $m$ affect optimization variance and end-task performance?
2. Figure 9 depicts the training dynamics of the effective generation length. What do these trajectories imply about how SPG allocates its reasoning budget over training? For tasks such as Countdown and Sudoku, the effective length shows large-amplitude fluctuations; what factors drive these oscillations? |
Lightly AI-edited |
|
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a policy gradient approach for RL fine-tuning diffusion LLMs (dLLMs). The authors not that the policy gradient involves the gradient of the log-likelihood of the policy, this log-likelihood being intractable in the case of dLLLMs. Existing approaches rely on an evidence lower bound (ELBO), which holds only when the return is positive. The authors build an evidence upper bound (EUBO) as a corollary of the Rényi variational bound, and use it for negative returns (ELBO being used for positive rewards). The authors also provide additional improvements. One is the mixing of ELBO and EUBO for negative returns, with a theoretical argument about decreasing variance (not discussing the bias though), the other one being a block-wise masking strategy for MC estimation. The authors then present a quite thorough experimental study, comparing to a number of baselines and ablating the different components of the proposed approach.
* The paper is clearly written and structured, easy to follow globally and well argued
* The point of the ELBO being no valid for negative rewards is a very good one, and the proposed approach makes a lot of sense, and provides good experimental results
* The empirical study is quite thorough with interesting ablations and relevant baselines
* From a reinforcement learning perspective, a lot of aspects regarding the proposed approach lack clarity. See questions for more details.
* The experiments may be presented in a too favorable aspect for the proposed approach. This may also be related to not enough discussed baselines (especially what are the key differences). See questions for more details.
### Clarity on the reinforcement learning aspect
The reinforcement learning part is presented in a quite limited way, raising a number of questions.
* The underlying Markov decision process is not even defined, it could
* The fundamental argument of the paper is that the ELBO doesn’t hold for a negative reward. This is true. However, in RL, the optimal policy is invariant to a reward shifting. So in principle one can assume the reward to be positive without loss of generality. This may not be the case empirically speaking, for various reasons (one being the variance of the policy gradient, for example), but this would call for at least some discussion, ideally some baseline/ablation. The more close ablation is SPG w/ neg, which is something fundamentally different (it’s closer to SFT on positive examples, ignoring negative ones, a smooth version of best-of-n). Such a baseline could be Reinforce, but one could also imagine a GRPO variation with a normalization leading to positive advantages (using the fact that a state-dependent baseline does not bias the gradient). Please discuss this aspect as thoroughly as possible.
* Objective in Eq (4) is biased, because GRPO one is biased (because the baseline does depend on the sampled action). The correct way to do it would be to have a leave-one-out empirical expectation of the return, known as RLOO [A, B]. It’s not a big deal, probably doesn’t change much empirically speaking, but given that the whole point is about better sandwiching the policy objective, it is worth starting from an unbiased one.
* Obj in Eq (4), and then the proxy in Eq (5), do not consider importance sampling at all. However, it seems that most of the baselines are. Is the proposed approach purely on-policy, like Reinforce, or does it insures some off-policyness, like GRPO, but not taken into account in the loss, implying some additional bias in what is sandwiched ? (As a side remark, RLOO without importance sampling is not biased in the off-policy case, as a corollary of [C], but the overall discussion point remains)
* Many approaches in LLMs, but also in dLLMs, consider regularized RL (typically KL-regularization towards the initial model). It is not discussed here. Is it that no regularization is considered (which is perfectly fine, but worth discussing), or that it is skipped to lighten notations (but in this case it should not, could have implications)
### Too favorable aspect of experiments
This may be a wrong feeling, happy to be corrected, but it relies on the following points.
* As an initial side note, it would make sense to consider some baselines/ablations suggested above (related to the optimal policy invariance to shifting rewards).
* The main point is that the considered baselines (D1, WD1, UniGRPO) are not described enough (even considering the related works section in the appendix), and so we do not know what are the key differences with the proposed approach. The current ablation shows well that the different components are important to proposed approach, but what is missing is that the considered baselines are probably a form of ablation somehow. For example, SPG w/ ELBO seems pretty close to some of the baselines (like uniGRPO, but without IS/clipping?), how does it compare? Another example is that, if we look at the results of Table 1 for the baselines, and compare it to the results of SPG with the random masking strategy, the results are much closer (eg uniGRPO vs SPG w/ EUBO and random masking). So one may wonder if just applying the block-wise masking to uniGRPO would not lead to very good results too (weakening the EUBO contribution). Maybe it is not the case thanks to the mixing, but also maybe that the results are more nuanced then how they are showed (or not, but this would then call for a better discussion about baselines, their key differences, notably wrt SPG).
* Another point that presents too favorably the proposed approach is that the SPG specific hyper parameters ($\beta$ and $\omega$) are basically tuned on the test set, which biases too favorably SPG.
[A] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLM, Ahmadian et al., 2024
[B] Buy 4 REINFORCE Samples, Get a Baseline for Free! Kool et al, 2019.
[C] Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion, Flet-Berliac et al, 2024. |
Fully human-written |
|
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a new reinforcement learning (RL) algorithm for diffusion large language models (dLLMs), which have intractable likelihoods that make standard policy gradient methods infeasible. Prior approaches approximate the log-likelihood using the ELBO, but the lower-bound approximation introduces gradient bias and limits learning from negative rewards. To overcome this, the authors introduce the Sandwiched Policy Gradient (SPG) method, which optimizes a “sandwiched” objective combining both a lower bound (ELBO) for positive-reward samples and an upper bound (EUBO) for negative-reward samples, thereby reducing bias in the policy gradient. SPG further employs a block-wise masking strategy to stabilize Monte Carlo estimation and a mixture formulation that adaptively blends upper and lower bounds to reduce gradient variance. Experiments on four reasoning benchmarks—GSM8K, MATH500, Countdown, and Sudoku—show that SPG achieves consistent improvements.
- Originality:
A novel Sandwiched Policy Gradient (SPG) framework is proposed to leverage both lower and upper bounds of log-likelihood for diffusion LLM, which is a clear conceptual advance over prior ELBO-only RL approaches.
- Quality:
Theoretical development is coherent and well-motivated, with a solid integration of Rényi-based upper bounds and a mixture formulation that balances bias and variance.
- Clarity:
The paper is well-written and logically structured. Figures and equations clearly illustrate the SPG process.
1. Non-standard evaluation protocol.
The paper selects checkpoints every 100 steps based on the highest test accuracy, which risks test set overfitting. While this follows d1 for consistency, it is methodologically questionable. The model should instead be selected via a validation set or by reporting the final checkpoint performance, as adopted in [1]. Revising the evaluation protocol would improve the experimental rigor.
2. Absence of standard RL stabilization techniques.
SPG appears to use a naive policy gradient update without employing the clipping or KL-regularization mechanisms used in PPO or GRPO. Without importance sampling ratio corrections, it is unclear whether SPG remains stable for off-policy updates $\mu > 1$. This raises concerns about potential instability or divergence during training.
3. Lack of comparison with key related work.
The paper omits comparison with [2], which is the first to successfully apply trajectory-level RL to diffusion LLMs and reports state-of-the-art results on similar reasoning benchmarks. Including this baseline is essential for contextualizing SPG’s improvements and validating its claimed advantage.
[1] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation, arXiv:2506.20639
[2] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models, NeurIPS 2025.
- Regarding Weakness 1: Would the authors consider adopting a more standard evaluation setup (e.g., validation-based checkpoint selection or reporting final-step results) to reduce overfitting concerns?
- Regarding Weakness 2: Could the authors clarify why SPG does not adopt clipping or KL regularization? Is this omission due to the intractability of computing importance ratios caused by the constant $C(T)$ term in EUBO, or was it omitted for simplicity?
- Regarding the tightness of the EUBO: As discussed in [3] and [4], the ELBO of dLLMs equals the AO-ARM loss, and becomes tight when the joint distribution $p(x|\sigma)$ is consistent across different orders of $\sigma$ (i.e., the inequality in Eq. (2) of [3] holds with equality). However, as shown in Appendix C.3 of this paper, the EUBO does not appear to be tight even under this condition. Could the authors explain why this phenomenon occurs, and whether it implies a fundamental looseness in the Rényi-based upper bound?
[3] AUTOREGRESSIVE DIFFUSION MODELS, ICLR 2022
[4] YOUR ABSORBING DISCRETE DIFFUSION SECRETLY MODELS THE CONDITIONAL DISTRIBUTIONS OF CLEAN DATA, ICLR 2025 |
Moderately AI-edited |
|
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses policy gradients for Diffusion language models where $\log \pi_{\theta} (x |c)$ is intractable by optimizing a lower bound for positive-advantage traces and an upper bound for negative advantage traces with a practical block-wise masking estimator. Because dLLMs do not expose a tractable log-likelihood, prior works optimize ELBO or one-step proxies, which bias gradients (ELBO is only a lower bound). Sandwiched Policy Gradient proposes to "sandwich" the objective: maximize a tractable lower bound $L_{ELBO}$ on positive-advantage samples while minimizing a tractable upper bound $U_{EUBO}$ on negative advantage samples (For instance, GRPO). The problem is well-defined with clear mathematical analysis and extensive experiments on RL baselines.
1. Very clear problem: ELBO-only training biases gradients when rewards can be negative; upper bounds make the algorithm penalize low-reward traces without relying on true $\log \pi$.
2. Strong experimental results: 4 reasoning tasks achieve very high improvement/performance gains.
3. The theory is reasonable. EUBO derived from a Renyi variational inequality gives a tractable surrogate. Mixing lower/upper bounds reduces variance both intuitively and theoretically (Prop. 1) and avoids gradient clipping/vanishing.
1. Baselines. Comparisons focus on GRPO-like/diffusion-RL. Missing: preference optimization tailored to diffusion (DPO-style for dLLMs), score-function or pathwise estimators based on learned surrogates or likelihood-ratio via learned decoder controls.
2. The datasets are limited to reasoning tasks. Though the performance gains are significant, the datasets are limited to math/logic. No tool-use or multi-turn agents settings. No natural-language preferences.
3. Training cost vs baselines and wall-clock time are not clearly examined. Blockwise MC might add computational overhead. Can the authors report that the computational overhead for each method?
What’s the wall-clock and token-throughput overhead of block-wise MC vs random masking and vs GRPO/D1/W1, at equal budgets?
Could the authors add (even small-scale) comparisons to (a) DPO-style for diffusion, (b) score-matching/pathwise surrogates with a learned decoder, (c) GRPO + stronger variance reduction?
Can the authors add more datasets? (non-reasoning, multi-turn, tool-use benchmarks) |
Fully human-written |