ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	0 (0%)	N/A	N/A	N/A
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	4 (100%)	4.50	3.50	2516
Total	4 (100%)	4.50	3.50	2516

Title	Ratings	Review Text	EditLens Prediction
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper provides a useful bounds for Conditional Discrete Flow Matching (CDFM) controlling the sampling error in total variation by a so-called riskk directly linked to the training loss. The authors also provide quantitative universality bounds for Transformers in the context of CDFM. - The main body of the paper is well written. - The main TV bound is useful - The Roadmap on page 6 is a nice addition to the paper to help understand how the proof works. Major: the math is particularly clumsy and not to the expected level of a purely theoretical paper. - The hypothesis that the velocity approximator is "factorized" seems unnecessary - The whole Appendix B needs rewritting. In fact, the whole section is a combination of statement and proofs of undergraduate level written in a rather inefficient way. For instance, - Lemmas B.5 and B.8 are standard undergraduate analysis statements (and the proof of B.8 is incorrect under the assumptions used, it is a common mistake but still...). Similar remarks apply to part of section D. - Lemma B.9 and B.6 are so standard they may probably be stated without proof. Lemma B.10 would be highly simplified by using the Lipchitz constant of (X,Y) -> X+Y and better used of B.5. - Theorem B.17 is not original and I fail to understand why it is rewritten here or what the proofs written here brings to the discussion. - Appendix C has too many problems to be acceptable: - Lemma C.3 is FALSE!!!! It is only true if the matrices $U_{t,\theta}$ commute. - Holder inequality seems to be unknown to the authors as they repeatedly bound the L1 norm of a product by the product of the L1 norms. - That being said, I do not doubt the final result, pending replacement of Lemma C.3 by slightly more sophisticated inequality one can obtain using Gronwall Lemma. Minor - The chosen embedding of vocabulary is non-standard, quite restrictive and an unnecessary hypothesis. - The same holds for the timestep embedding. - Theorem 4.6 statement is unecessarily complicated, the notations of the space of transformers is not defined - Lemma 4.4 may be most likely found in a functional analysis book, although in the context of an ML article, reproving it directly as in Appendix E is acceptable. - some typos - Theorem 4.6 is expected to hold in some form for any architectures for which one is able to prove a quantitative universality theorem. It is not specific to transformers, stating the result only for transformers is rather odd. See Weaknesses.	Fully human-written
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper provides a theoretical analysis of end‐to‐end training of DFM models. They bound the distance between the generated distribution and the true data distribution by the risk of the learned probability velocity. Then this risk is analyzed through two sources: approximation and estimation errors. The approximation error is determined by the capacity of the transformer, and the estimate error is due to the training on a finite dataset. The paper provides theoretical justification (for the first time in discrete flow‐matching) that model error can be controlled and that convergence is guaranteed under ideal conditions. The paper is well written. To my knowledge this is the first formal proof of convergence for DFM. It provides a valuable theoretical contribution to the growing area of discrete generative modeling and flow‐matching methods. They decompose the errors for learning the probability velocity into the approximation error (due to model architecture) and the estimation error (due to the sample size), assuming that training is done ideally. The analysis is built upon a slightly different version of the loss function from the original discrete flow matching (Gat et al, 2024). Maybe it is better to elaborate a bit more on the connection between those loss functions and justify the use of squared $\ell_2$ as the Bregman divergence. There is a connection between discrete diffusion models and discrete flow matching. The original DFM training loss is identical to the masked diffusion training loss. There exists earlier literature on the convergence analysis of the discrete diffusion models. It’s better to include some discussions on the similarity and differences, and novelty compared to earlier works on discrete diffusion. In addition, since the paper only provides upper bounds on the TV distance and probability velocity risk, I feel that it needs some empirical evidence to show how tight those bounds are. 1. In the original discrete flow matching, the training loss uses cross-entropy terms (see Eq. (28) in Gat et al, 2024). Is it equivalent to the loss term in Eq. (2.10) by using the $\ell_2$-norm in Bregmann divergence? 2. Does using different form of Bregmann divergence affect the results? 3. In Theorem 3.1, is $T = 1$? How big is $t_0$? 4. Is it possible to show some empirical evidence of the scaling behavior (e.g., $M$, $n$, or $d_0$) on some toy data?	Fully human-written
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors provide a full error analysis for a discrete flow-matching model parametrized by a transformer, trained on factorized mixture path interpolants. On the technical level, a central result is their construction of a smooth velocity field valued on a continuous domain, extending the true velocity field which takes discrete arguments. This in turn allows them to apply approximation results for transformers, and reach error guarantees, alongside norm bounds for the model parameters. The work extends the growing body of works on theoretical analyses of flow-matching models to the discrete case. This extension requires technical extensions such as Lemma 4.4, which allows to bridge from the discrete case to the continuous case. The contribution is, to the best of my awareness, novel and interesting, and will thus prove of interest to the ICLR community. The paper is overall clearly written, and the main technical ideas sufficiently explained and motivated. In this light, I am overall in favor of acceptance; however, I have limited familiarity with the field, and did not read the proofs in detail. My main criticism is the lack of sufficient discussion on certain aspects, which I highlight in the questions section. - The main weakness of the bound is an error guarantee exhibiting a curse with the effective dimensionality $Md_0$. Could the authors provide more intuition or discussion on why this is the case? Section $6$ discusses the prefactor but not the rate. Notably, it seems that this depends on the choice of the model through $d_0$ and is not a fundamental limitation. If this is the case, could the authors brieflw discuss this point and how it can be refined in future works? - I understand that the discrete-to-continuous mapping leverages the natural inclusion embedding of $\mathcal{V}$ in $\mathbb{R}$, with vocabulary items of higher index receiving larger values. Naively, it seems that this embedding, unlike e.g. one-hot encodings, does not treat all vocabulary items in the same fashion, and is not the most natural embedding. Could the authors comment on this, and how this choice impacts (or not) the bounds? - Further motivation on the choice of transformer architectures would be helpful. In particular, many previous works on continuous generative models studied ResNet or auto-encoder architectures (e.g. Boffi et al; Shallow diffusion networks provably learn hidden low-dimensional structure; 2024). Is the transformer particularly adapted to the discrete case, or simply chosen to reflect most practical settings? - (Minor) l 266 : "total variation distance bound scales exponentially with vocabulary size M" I am confused as I do not see any exponential dependence in Theorem 3.1. - I invite the authors to include further comparison with analyses in the continuous case to better highlight the main differences, which would be interesting. In particular, I am wondering whether the authors have an idea how an analysis of the time discretization error would proceed and differ in the discrete setting ?	Fully human-written
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper provides the first end-to-end theoretical analysis of Discrete Flow Matching (DFM). It establishes: (1) Intrinsic error bound linking velocity field risk to distributional error; (2)Approximation theory showing Transformers can approximate DFM velocity fields; (3)Estimation theory giving statistical convergence rates for training DFM-Transformers. In summary, the paper produces a complete approximation and estimation theory analogous to classical diffusion and flow matching results but for discrete flow matching. (1) The paper proposes a rigorous end-to-end theory for descrete flow matching. (2) The chain of results (intrinsic error, approximation error, estimation error) is elegant and well-motivated. (3) The derivation of total variation bounds via Kolmogorov equations and Grönwall’s inequality seems technically non-trivial and new. (1) The paper currently lacks empirical experiments. In pratice, the is hard to make the training objective to be very small while the error bound relies heavily on terms like \sqrt(M)exp(M). The resulting guarantees could be loose for typical M, limiting practical interpretability. (2) Some modeling assumptions appear stronger than usual, for example, time-Hölder smoothness of CTMC paths, exponentially large UA constants for transformers, and a global velocity-field bound. Clarifying why these are needed and how sensitive the results are to them would improve transparency. The theoretical analysis is interesting; I would consider raising my score if the authors address the following question: (1) Could the authors analyze error bounds for the out-of-vocabulary (OOV) case, now or in future work? The current theory appears to assume a finite discrete state space; with OOV tokens, how should the problem be treated? (2) Recent work (e.g., Dirichlet Flow, H. Stark 2024; Diffusion LM, Shen et al., 2025) models discrete diffusion/flow using cross-entropy objectives that naturally yield KL bounds. Why is MSE chosen here instead, and why analyze total-variation (TV) bounds rather than KL?	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next