|
Deterministic Discrete Denoising |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposed a deterministic algorithm for sampling from a uniform discrete diffusion model, in the sense that the only source of randomness is the initial random noise. The algorithm is based on the variant of herding algorithm. The authors demonstrated the performance of the proposed method on text and image generation tasks.
To the best of my knowledge, the herding algorithm proposed in this paper has not been applied to the task of sampling from discrete diffusion models before, which makes this work original. As shown in their experiments, the proposed method achieves competitive performance compared to existing stochastic sampling algorithms. The paper is well-written and clearly structured. The mathematical notations are mostly clean and consistent.
The largest weakness of the paper lies in its limited scope of applicability. The proposed method is only effective for **uniform** discrete diffusion models, and the experiments are also all limited to this setting. However, the current trend in discrete diffusion models is to use **masked** discrete diffusion models, which achieve better performance in various tasks, especially in text generation, and are scalable to larger models and datasets. Due to the nature of the masked discrete diffusion models, the proposed method cannot be directly applied, so I feel the impact of the proposed method is limited.
Moreover, as the uniform discrete diffusion models do not perform well on many of the modalities than masked discrete diffusion models or other benchmarks, the improvements shown in the experiments may not be very meaningful in practice. The paper only demonstrates quantitative improvements in terms of generative perplexity, entropy, FID and IS. These seemingly significant improvements do not necessarily translate to better quality of generated samples, as they can be as bad as the one produced by stochastic sampling methods on the same models. For instance, 74.95 perplexity on GPT-2 level text may still be of low quality, and 19.20 FID on CIFAR-10 is much worse than most of the state-of-the-art generative models nowadays, which have FID scores below 10. It would be more convincing if the authors could provide qualitative results of the generated samples.
In addition, I understand it is an interesting theoretical problem, but it would be beneficial if the authors could provide more convincing reasons on the motivation of studying deterministic samplers. For continuous diffusion models, deterministic ODE samplers are preferred as their generated trajectories are straighter ($C^1$ continuity of ODE v.s. $C^{\frac12-}$ continuity of SDE), and thus lead to smaller discretization errors when using the same step size. Moreover, deterministic ODE samplers enable distillation (e.g., consistency model). However, I don't see similar motivations for discrete diffusion models.
Overall, I think the paper provides an interesting algorithm, but its limited scope of applicability makes it less impactful in practice. While the work contains some novel ideas, the combination of limited applicability, modest experimental results, and lack of strong theoretical foundations suggests it doesn't meet the standards for top-tier ML/AI conferences.
Could the authors provide some theoretical guarantees on the convergence of Eqs. (7,8)? Can we prove that $||\mathbf{w} _ t||$ is bounded under some mild assumptions in this simple case? As the authors have mentioned in Sec. 5, the initialization of $\mathbf{w} _ T$ is $\text{Unif}([0,1]^K)$. How does this choice affect the convergence, and is there any reason why this specific initialization was chosen, instead of, say, $\text{Unif}([-1,1]^K)$?
Comparing Eqs. (7,8) and Eqs. (9,10), why there's $(\mathbf{w} _ t+\mathbf{p} _ {t-1})^\top\mathbf{x}$ instead of $\mathbf{w} _ t^\top\mathbf{x}$? Is there any intuition of the convergence and boundedness of the weights in this time-inhomogeneous case? Also, as in practice we consider the sequence of $L$ tokens, i.e., $\mathcal{V}^L$ instead of $\mathcal{V}$, the notations in Eqs. (9,10) need to be modified accordingly. Pseudo-code may also be helpful to present the algorithms in a clearer way.
As pointed out in the weaknesses, I suggest the authors to provide some qualitative results of the experiments. Also, a more detailed discussion on the related works on herding algorithms would be helpful to position the proposed method in the literature. In particular, more explanations on UDLM is needed to for readers unfamiliar with it. |
Lightly AI-edited |
|
Deterministic Discrete Denoising |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposed a deterministic algorithm for the denoising procedure in discrete diffusion models, which can be thought of as an analog of Denoising Diffusion Implicit Models (DDIM) for the setting of discrete diffusion models. Specifically, instead of using the categorical sampler or the Gumbel-max trick for each time step in the denoising procedure, the authors borrowed ideas from the herding algorithm and chaotic dynamical systems to introduce a “DDIM-like” deterministic reverse process for discrete state spaces. Numerical experiments on tasks like text and image generation are provided to justify the effectiveness of proposed methodology.
(1) This paper studies an important topic with clear motivation, positioning and presentation. To the best of the reviewer's knowledge, there has been no prior work that investigates how deterministic samplers like DDIM can be generalized to the setting of discrete diffusion models.
(2) The reviewer finds the perspective of chaotic dynamical systems/herding algorithm introduced in this paper to be particularly interesting, which has the potential to motivate further studies.
(1) One possible drawback of the paper is that a theory part seems to be missing here, especially given that the authors only presented two sets of experiments (text and image generation) for the empirical part. Though the reviewer understands that the focus of the manuscript is on the algorithm and methodology, the authors are encouraged to include at least a literature review of existing work on the theoretical properties of discrete diffusion models (A incomplete list of related literature: [1,2,3,4,5,6,7]) and briefly comment on how the proposed algorithm can possibly be analyzed theoretically by combining [1-7] with results derived from previous work on the theoretical properties of the herding algorithm (An incomplete list of related references: [8,9,27]).
(2) Though the experiments conducted here cover both text and image generation, it seems to the reviewer that the quality of the manuscript can be further improved by adding extra baselines. Most importantly, it seems that the authors didn't include [9] as one baseline for the MDLM case, which definitely needs to be added. Moreover, the authors should probably also consider adding a review of related work on existing methods for accelerating the inference speed/improving the generation quality of discrete diffusion models. An incomplete list of related work include but not limit to distillation-based methods [10,11,12,13], design of sampling schedule [14,15], high-order numerical solvers [16] and other techniques [17], etc. Especially for training-free methods like [14,15,16,17], the authors are encouraged to include (a subset of) them as extra baselines for the experiments here.
(1) Given that large language diffusion models has emerged as a hot topic recently, would it be possible for the authors to comment on whether it would be possible to generalize the methodology proposed here to LLaDA [18,19]? If so, how do the algorithm relate to standard algorithms like parallel/speculative decoding in LLMs?
(2) Since the time-dependent scheme introduced in this paper augments every discrete token with a continuous weight and then performs an argmax-style update at each timestep, would it be possible for the reviewer to ask how does the dynamics in this paper relate to the inference-time scaling framework for diffusion models (an incomplete list of related references: [20,21,22,23,24,25,26]), which all essentially use the Sequential Monte Carlo algorithm to simulate a gradient flow of Fisher–Rao type? Does actually lead to the inherent connection between Fisher-Rao dynamics and the dynamical system associated with the herding algorithm?
Overall, I think the paper might be considered for top ML venues like ICLR, but the authors should probably address all questions above, add papers listed below as extra references and discuss them appropriately.
References:
[1] Srikanth, Aadithya, Mudit Gaur, and Vaneet Aggarwal. "Discrete State Diffusion Models: A Sample Complexity Perspective." arXiv preprint arXiv:2510.10854 (2025).
[2] Huang, Xunpeng, Yingyu Lin, Nishant Jain, Kaibo Wang, Difan Zou, Yian Ma, and Tong Zhang. "On the Complexity Theory of Masked Discrete Diffusion: From $\mathrm {poly}(1/\epsilon) $ to Nearly $\epsilon $-Free." arXiv preprint arXiv:2509.21835 (2025).
[3] Liang, Yuchen, Yingbin Liang, Lifeng Lai, and Ness Shroff. "Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees." arXiv preprint arXiv:2509.16756 (2025).
[4] Liang, Yuchen, Renxiang Huang, Lifeng Lai, Ness Shroff, and Yingbin Liang. "Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models." arXiv preprint arXiv:2506.02318 (2025).
[5] Ren, Yinuo, Haoxuan Chen, Grant M. Rotskoff, and Lexing Ying. "How discrete and continuous diffusion meet: Comprehensive analysis of discrete diffusion models via a stochastic integral framework." arXiv preprint arXiv:2410.03601 (2024).
[6] Zhang, Zikun, Zixiang Chen, and Quanquan Gu. "Convergence of score-based discrete diffusion models: A discrete-time analysis." arXiv preprint arXiv:2410.02321 (2024).
[7] Chen, Hongrui, and Lexing Ying. "Convergence analysis of discrete diffusion model: Exact implementation through uniformization." arXiv preprint arXiv:2402.08095 (2024).
[8] Harvey, Nick, and Samira Samadi. "Near-optimal herding." In Conference on Learning Theory, pp. 1165-1182. PMLR, 2014.
[9] Bach, Francis, Simon Lacoste-Julien, and Guillaume Obozinski. "On the equivalence between herding and conditional gradient algorithms." arXiv preprint arXiv:1203.4523 (2012).
[9] Chen, Zixiang, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. "Fast sampling via discrete non-markov diffusion models with predetermined transition time." Advances in Neural Information Processing Systems 37 (2024): 106870-106905.
[10] Hayakawa, Satoshi, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. "Distillation of discrete diffusion through dimensional correlations." arXiv preprint arXiv:2410.08709 (2024).
[11] Fu, Feiyang, Tongxian Guo, and Zhaoqiang Liu. "Learnable Sampler Distillation for Discrete Diffusion Models." arXiv preprint arXiv:2509.19962 (2025).
[12] Zhu, Yuanzhi, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. "Di [M] o: Distilling masked diffusion models into one-step generator." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18606-18618. 2025.
[13] Zhu, Yuanzhi, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. "Soft-di [m] o: Improving one-step discrete image generation with soft embeddings." arXiv preprint arXiv:2509.22925 (2025).
[14] Amin, Alan N., Nate Gruver, and Andrew Gordon Wilson. "Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion." arXiv preprint arXiv:2506.08316 (2025).
[15] Park, Yong-Hyun, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, and Yuki Mitsufuji. "Jump your steps: Optimizing sampling schedule of discrete diffusion models." In The Thirteenth International Conference on Learning Representations. 2024.
[16] Ren, Yinuo, Haoxuan Chen, Yuchen Zhu, Wei Guo, Yongxin Chen, Grant M. Rotskoff, Molei Tao, and Lexing Ying. "Fast solvers for discrete diffusion models: Theory and applications of high-order algorithms." arXiv preprint arXiv:2502.00234 (2025).
[17] Ben-Hamu, Heli, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. "Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking." arXiv preprint arXiv:2505.24857 (2025).
[18] Nie, Shen, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. "Large language diffusion models." arXiv preprint arXiv:2502.09992 (2025).
[19] Zhu, Fengqi, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou et al. "LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models." arXiv preprint arXiv:2505.19223 (2025).
[20] Singhal, Raghav, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. "A general framework for inference-time scaling and steering of diffusion models." arXiv preprint arXiv:2501.06848 (2025).
[21] Chen, Haoxuan, Yinuo Ren, Martin Renqiang Min, Lexing Ying, and Zachary Izzo. "Solving inverse problems via diffusion-based priors: An approximation-free ensemble sampling approach." arXiv preprint arXiv:2506.03979 (2025).
[22] Skreta, Marta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alán Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, and Kirill Neklyudov. "Feynman-kac correctors in diffusion: Annealing, guidance, and product of experts." arXiv preprint arXiv:2503.02819 (2025).
[23] Pani, Chinmay, Zijing Ou, and Yingzhen Li. "Test-Time Alignment of Discrete Diffusion Models with Sequential Monte Carlo." arXiv preprint arXiv:2505.22524 (2025).
[24] Ma, Nanye, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang et al. "Inference-time scaling for diffusion models beyond scaling denoising steps." arXiv preprint arXiv:2501.09732 (2025).
[25] Ren, Yinuo, Wenhao Gao, Lexing Ying, Grant M. Rotskoff, and Jiequn Han. "Driftlite: Lightweight drift control for inference-time scaling of diffusion models." arXiv preprint arXiv:2509.21655 (2025).
[26] Lee, Cheuk Kit, Paul Jeha, Jes Frellsen, Pietro Lio, Michael Samuel Albergo, and Francisco Vargas. "Debiasing guidance for discrete diffusion with sequential monte carlo." arXiv preprint arXiv:2502.06079 (2025).
[27] Lacoste-Julien, Simon, Fredrik Lindsten, and Francis Bach. "Sequential kernel herding: Frank-Wolfe optimization for particle filtering." In Artificial Intelligence and Statistics, pp. 544-552. PMLR, 2015. |
Fully human-written |