|
Variational Masked Diffusion Models |
Soundness: 4: excellent
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes VDM that adds a global latent variable z to MDM so that the distribution is jointly conditioned on a partially masked sequence, and the latent variable reduces the token dependency. Addressing token dependency is crucial for MDM in the sense that its parallel sampling nature can lead to sampling errors since the MDM itself doesn't model the token dependency. This work provides (synthetic) experimental results to support their claim on the token dependency.
The idea of tacking token dependency through a latent variable seems correct, as it provides more information on the posterior distribution that cannot be given solely by the partially masked sequence. Also, the ELBO object is theoretically grounded, resulting in a reasonable training loss. Moreover, the experiments are well-designed to show that the VDM indeed captures the token dependency much better.
The experimental claim in this paper is apparently weak.
1. In the synthetic dataset, the experiment is well-designed and
2. In the text data, although I appreciate the author's effort on pretraining VDM from scratch, the small difference in generative perplexity (Table 5) isn't enough to tell that VDM is much better than MDM in capturing the token dependency.
3. In the Sudoku puzzle, the VDM's accuracy is also marginally better than baseline, e.g., Top Prob margin. Given that it's a small-scale experiment (not done with large-scale MDM), this small difference in accuracy cannot firmly support the claim. Moreover, I don't fully get the reason why VDM would work better than MDM in the Sudoku setup, where the answer to a given puzzle is often deterministic. This is because even though we provide z as an additional conditional variable, if the answer is fixed given the partially filled board, then the latent variable z isn't playing any meaningful role, i.e, p_true (x_0^i | x_t) = p_true (x_0^i | x_t ,z). I believe this was probably the fundamental reason why VDM failed to outperform MDM baselines.
Given this insight, I believe there will certainly be an experiment setup where (1) the answer is not uniquely defined, so that providing z as a context could be meaningful, (2) modeling the token dependence is crucial to get a good result, i.e., logic puzzles. I understand that the main contribution could be interpreted as formulating the VLM; however, I believe ideally the authors could've shown a successful case in which VDM outperforms MDM, further enhancing its applicability.
I wonder how the authors think about my point in the Weakness section! |
Fully human-written |
|
Variational Masked Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes variational masked diffusion (VMD) that introduces a latent variable to masked diffusion models in order to capture the dependencies among the predicted positions given a partially masked sequence during multi-token generation. The training loss is derived from the ELBO of the data likelihood, similar to VAEs. The authors also propose a blockwise formulation that augments the block masked diffusion model. The experiments demonstrate that VMD outperforms existing model structures on various modeling tasks on synthetic data and language modeling.
The paper is well-written and easy to follow, with clean mathematical derivations. The experimental results contain diverse tasks, which demonstrate the effectiveness of the proposed method in a convincing manner.
The main contribution of this paper is to introduce a latent variable into masked diffusion models, and train it in a VAE-like manner with a small overhead of the encoder and decoder networks. This is an interesting idea that has not been widely explored in the masked diffusion literature, but I'd like to mention that a prior work **VADD (arXiv:2505.17384)** has already explored similar concepts, and the only novelty of this paper compared with VADD seems to be the blockwise formulation, which is a relatively minor extension. While I acknowledge that according to the review policy this is not grounds for rejection, I strongly encourage the authors to discuss this prior work and clearly articulate the differences and contributions of your paper.
As mentioned above, the authors are suggested to incorporate a more comprehensive literature review of related works.
I appreciate the authors for providing experimental results on various modalities. However, these are all relatively small-scale:
- The paper takes at least two pages to present the experiments on synthetic data with only 2 or 4 dimensions, which seems too trivial. In these tasks, the data distribution has strong correlations among dimensions, so it is expected that introducing a latent variable would help compared with independent decoding. Therefore, this serves more as a sanity check rather than a convincing demonstration of the effectiveness of the proposed method in real-world scenarios.
- For the experiments on Sudoku, while it is a more complex task, the improvement over the vanilla masked diffusion model is relatively marginal, and surprisingly, the improvement at NFE = 5 ($+1.3\%$) is much smaller than that at NFE = 10 and 20 ($+2.3\%$ and $+2.5\%$), which is a little bit counter-intuitive to me since I expect the advantage of modeling dependencies should be more pronounced when the NFE is smaller.
- Finally, for language modeling, the dataset used is text8, which is quite small and outdated compared with more recent larger-scale datasets used in discrete diffusion models, which are at least on GPT-2 level. Thus, it remains unclear whether the proposed method can scale to more complex and larger-scale language modeling tasks. The improvement of VMD over BD3-LM is also relatively insignificant, which may also be due to the small scale of the dataset.
1. In section 3, how to go from (3) to (4)? It seems that a better way to present this is to define $p _ \theta(x _ s|x _ t)$ by $\int \prod _ {i=1}^d p _ \theta(x _ s^i|x _ t,z)\cdot p(z)\mathrm{d}z$ first, which directly leads to (3) and (4) by partially integrating out dimensions in $x _ s$.
2. Could the authors provide a demonstration of how to modify the network architectures for masked diffusion models in order to receive the latent variable as input? A more interesting question is, can we use a pretrained masked diffusion model to adapt to VMD by freezing (or training with small learning rate) the original network weights and only training the overhead? Experiments along this direction would be interesting.
3. For table 2, it is better to convert it to a plot with $p$ on the x-axis and KL on the y-axis, so that we can better visualize the trend.
4. I'm thinking about the design of the latent variable $z$, which is trained to capture the dependencies among different masked dimensions in a partially observed sequence. Setting the prior of $z$ and the output of encoder as Gaussian distributions is of course a standard choice, but I feel that it may not be optimal in this case due to the lack of semantic meaning. Do the authors have any thoughts on this? For example, would it be better to use a more complex prior such as a mixture of Gaussians, or even a discrete latent variable? Moreover, for sampling (algorithm 2), I think it is also possible to resample the latent variable $z$ at each step instead of only once at the beginning. Have the authors considered this? Would it improve the performance? |
Fully human-written |
|
Variational Masked Diffusion Models |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Discrete diffusion models have can flip multiple tokens at once, from the product distribution of their marginals, leading to poor modeling of inter token dependencies. To address this issue, the authors introduce latent variables such that the model is conditionally independent given the latents. This formulation is conceptually similar to the encoder-decoder setting used in Variational AutoEncoders and is trained with a suitable ELBO loss. The authors give a similar extension to the Block Diffusion Framework.
The efficacy of the framework is demonstrated with synthetic experiments, SuDoKu and language modeling with the text8 dataset.
* The setting considered by the work is very important, especially to improve the performance of diffusion models. I believe that the presented ideas are in the right direction.
* The idea is cleanly formulated and the evaluations on synthetic datasets demonstrates its efficacy.
* The method shows some gains in solving SuDoKu problems, especially at low NFE.
* The training procedure comes with additional complexities, where an additional encoder needs to be trained. Training and tuning this correctly can be an additional burden.
* The experiments on text data show only marginal gains -- which I am not even sure is statistically significant and worth the effort of training an additional latent model.
* The gains in the SuDoKu experiments are marginal and the presentation is not very clear. It would be helpful if $c_{prob}$ and $c_{marg}$ are recalled and explained in this section more rigorously (also see questions).
* The authors use the pipeline from Kim et al (2025), along with their method as the baseline. However, I could not find the SuDoKu results for every NFE value presented in this reference since it only considers NFE=50. Please point me to the correct numbers in case I missed it. This is important since the authors claim that they reproduce these values in Table 4. |
Fully human-written |
|
Variational Masked Diffusion Models |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work considers the problem of dependence between sampled tokens from masked diffusion models (MDMs). One-step sampling in MDMs does not consider dependence between tokens, while multi-step sampling is significantly slower. This work proposes Variational Masked Diffusion Models (VMDs) that predict an intermediate latent variable that, when conditioned on, renders individual tokens independent of each other.
The paper studies an interesting problem and presents it well with a nice exposition. The topic is also important as it might reduce inference time while maintaining quality outputs. I think this work, either by itself or through its follow-ups, can impact real-world large language models.
I did not find major weaknesses in this work. Some minor questions are given below. Additional minor questions/comments on writing are provided under "questions."
**W1. Number of tokens per block**: Since VMDs have key advantages on sequences with high inter-token dependency, why are the experiments limited to at most 2 tokens per block? Can VMDs work with multiple tokens with strong dependencies between tokens that are located far from each other? Maybe on a needle-in-a-haystack type of dataset [A1], even if it is a synthetic one that is quicker to train on. I know the Sudoku experiment has more than two tokens per block, but you cannot measure the KL divergence in that experiment.
**W2. Obtaining inference-time latent $z$**: In Algorithm 2, the input is a fully or partially masked input sequence $x$. But $z$ is simply sampled from a Gaussian. When a partially masked input sequence is available, is it used to obtain $z$? If not, why?
**W3. Effect of resampling $z$**: Related to the previous question, does sampling different $z$ for the same starting sequence result in different outputs?
These are some minor writing-related comments/questions:
**Q1.** "NFE" used in Table 3 is defined in the appendix, but not in the main text.
**References**
[A1] Elliot Nelson, Georgios Kollias, Payel Das, Subhajit Chaudhury, Soham Dan, "Needle in the Haystack for Memory Based Large Language Models", ArXiv 2024. |
Fully human-written |
|
Variational Masked Diffusion Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Variational Masked Diffusion (VMD), a framework designed to address the limitation of independent concurrent token generation in standard masked diffusion models. VMD introduces a global latent variable to model joint dependencies among concurrently predicted tokens. Training of VMD is performed via variational inference over this latent variable. The inference of VMD starts by first sampling the latent variable and then iteratively unmasking tokens conditioned on the sampled latent variable. The framework is further extended to block diffusion models, combining autoregressive dependencies across blocks with variational diffusion within blocks. The proposed approach is evaluated on synthetic datasets, Sudoku puzzles, and the Text8 dataset.
- The paper studies a key problem of discrete diffusion models in capturing the joint dependencies of concurrently predicted tokens.
- The idea of using a latent variable to capture the joint dependencies is novel and has not been studied in the discrete diffusion literature (as far as I am aware).
- The authors have provided reasonably comprehensive experiments on synthetic data, but it would be interesting to get more comprehensive experiments on text.
- The basic formulation of the latent variable in Section 3 has an underlying assumption that z is independent of x_t. More concretely, in eq.(3) and eq.(4), there should be p(z / x_t) instead of p(z). I believe this approximation may lead to not enough improvement when the distance between p(z / x_t) and p(z) is large. Evidence of this problem appears to be in the results on text data in Section 4.4, where the improvement of VMD over BD3-LM for block size 8 is much smaller than for block size 4. This might be true because for x_t^b of block size 8 can have more information and hence p(z / x_t) can be different from p(z).
- While the experiments are quite comprehensive for synthetic data, they are somewhat lacking for text. It would be interesting to see the metrics like MAUVE and generative perplexity when using VMD with different numbers of inference steps. I wonder if, for a small number of inference steps, VMD provides a larger benefit compared to BD3-LM or MDM in general. This is possible because the model is required to more accurately capture the joint dependencies of concurrently predicted tokens.
The training of VMD uses q_{\phi}(z | x_0, x_t) - an approximate posterior parameterized by trainable parameters ϕ, but the inference doesn’t. Is there a possibility of using it to partially solve the issue mentioned in the first weakness? x_0 can be approximated with the denoised sequence before remasking during inference. |
Fully human-written |
|
Variational Masked Diffusion Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The token independence problem in masked diffusion can lead to degraded generation quality, which has been observed by several previous works. This paper presents a latent variable method, called VMD, for modeling the dependencies among tokens in masked diffusion. The authors also extend VMD to a block diffusion and remasking scheme. The effectiveness of VMD is validated on the artificial data sets, which can also be deemed as a contribution to experimental design.
- The presentation of this paper is clear, and it's easy for the readers to understand the whole methodology.
- The extension to the block diffusion and remasking scheme is straightforward but meaningful.
- The experimental design in Sections 4.1 and 4.2 can clearly demonstrate the VMD's ability to modeling dependencies, and can also be viewed as a contribution.
- The methodology of VMD is almost identical to the VADD model (https://arxiv.org/abs/2505.17384). As far as I know, VADD is the first work to consider using a latent variable model to define the transition probability $p_\theta(x_0|x_t)$, using a VAE framework for training, and discussing the related sampling framework. Specifically,
(a) model definition. Equation (3) in VMD is similar to equation (6) in VADD,
(b) training objective. Equation (5) in VMD is similar to equation (9) in VADD.
(c) sampler. Alg 2 in VMD is similar to Alg 2 in VADD. The authors should cite the VADD paper and clearly state their unique contribution.
- The sampling algorithm (Alg 2) of VMD assumes a fixed latent variable $z$ and argmax sampling. This is different from the latent variable model definition in equation (3). The authors should explain the rationale behind their design.
- In line 218, different blocks employ independent latent variable priors. This is not very intuitive for me, as different blocks should be correlated with each other. Could the authors please explain why they consider independent priors for blocks?
- The experiments are limited. The authors only test the text generation quality on the text8 data set. However, baseline methods, including MDLM, consider at least an OpenWebText data set with at least a GPT-2 model size. Experiments on larger backbone and data sets would better demonstrate VMD's superiority.
Please see the weaknesses. |
Fully human-written |