ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 4 (100%) 4.50 3.50 2997
Total 4 (100%) 4.50 3.50 2997
Title Ratings Review Text EditLens Prediction
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a technique to add log probability estimates to popular shortcut distillation techniques. They achieve this by considering the combined sample and log-density differential equation and writing down additional loss terms for the latter part in terms of the divergence of the instantaneous velocity. They specify these terms for two popular frameworks: shortcut models and MeanFlow. The authors then proceed to evaluate their method on CIFAR10 and Imagenet64, reporting NLL and FID for their method and baselines based on both MeanFlow and shortcut models. The authors also present an experiment doing self-guidance for improved FID. - The paper, especially up to the the experiments section is very well written, with easy to follow math that is presented with consistent notation throughout. - I think the topic is interesting, with potentially important untapped potential of current generative models - The method seems sound, and derivations (where applicable) seem correct - My main concern with the paper is that the contribution seems somewhat limited, while it is to the best of my knowledge true that this method has not been applied before, it seems like a relatively straightforward extension of earlier shortcut methods. - My other concern is that the experiments present some confusing results, and for a paper targeting log-likelihood, there is a lot of focus on FID in the provided tables, and not a lot of focus on log-likelihood. Other than the small experiment with self-guidance, there are also no experiments tailored specifically to investigate the quality of the approximation of the log-likelihood - Continuing on the experimental evaluation: it seems strange to me to evaluate the dataset NLL to the dataset NLL of an expert, since it is comparing averages (i.e. it is a summary statistic). It would in my opinion make more sense to compare likelihoods directly on samples, and calculate a mean of errors, such as the mean squared error or mean absolute error. In the current format, individual samples could have arbitrarily large errors while the error of the mean could still be small. - I would have hoped to see an experiment where the log-likelihood is more of a first-class member to see the effect of downstream performance. The authors mention for example the PPO family of methods in their introduction. It could also be instructive to add a toy-experiment in 1D or 2D where the ground-truth log density function is available. - The authors don't seem to to address the question whether the log-density still matches the log-density of the distribution. It seems to me that while the original distribution is is slow to sample from and the log-density similarly slow, at least there is theguarantee that the log-density is linked to the samples of the distribution. In this particular case, both parts (log-density and samples) are approximated, and it seems to me that there is no guarantee that these approximations match. Can the authors comment on whether the approximate log-density calculated is still a relevant number compared to the approximate distribution? - What makes NLL predictions "invalid" (Tab 1. and Tab 2. gray fields)? Is this an arbitrary threshold that the authors chose or is there something else that makes them invalid? - For CIFAR10, why do the authors underperform MeanFlow in some cases, and outperform it in other cases? Since the method operates on a separate head, their method should be fairly independent of performance, can the authors elaborate? - For CIFAR10, it seems that the MeanFlow models produce better FID as the number of samples decreases. This is contrary to what I would expect, and also contrary to what is reported in the original paper. Can the authors elaborate why this would be? Can the authors answer the same question for MeanFlow-F2D2? Fully human-written
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper targets an important problem of making FM based generative models to have fast likelihood estimation. The proposed method is based on flow map distillation methods in the sampling space, such as short-cut model and meanflow. The authors conduct experiments on CIFAR-10 and ImageNet-64 to demonstrate that their NLL can be effectively calculated in a single step. - This paper focuses on an important problem and is well motivated, as the author mentioned, fast likelihood computation could open up a wide area of future work, such as in the RL community. - The writing is clear and easy to follow. - The proposed method is simple and straightforward without the need of adversarial training or other additional tricks. - The result in CIFAR-10 demonstrated that the proposed model can achieve state-of-the-art sampling quality in one step while having good NLL performance. While the proposed approach sounds promising, the experiment results are lacking to fully support the paper's claim. - The ImageNet 64x64 model's FID is far from state-of-the-art. - MF has been reportedly by the community being not stable, adding more loss term could make it more stable for larger dataset/model. The scalability of the proposed approach is not studied. - One important baseline Tarflow[1] is missing given it can also do fast NLL calculation and sampling. [1] Zhai, S., et al. "Normalizing flows are capable generative models (2025)." URL https://arxiv. org/abs/2412.06329. - Why did the FID drop in Table 1 for meanflow with F2D2 compared to without ? Limited network capabilities? - Could the author clarify why the meanflow-F2D2 is initiated from a short-cut-Distill model ? This makes it harder to judge the effectiveness of meanflow to your likelihood distillation. Fully human-written
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces fast flow joint distillation(F2D2), which jointly learns sampling trajectory and divergence using a joint self-distillation objective loss and a single model to enable few-step sampling and fast log-likelihood evaluation in flow-based models. It shows that the method is compatible with Shortcut and Meanflow Models and can be further distilled. The proposed method performs on image-based dataset and shows optimistic results. 1. It's interesting to learn vector field and divergence by sharing the same backbone with separate prediction heads since they are related to each other. 2. The proposed method can be extended to shortcut model and meanflow model. 1. The objective loss of shortcut F2D2 is composed of 4 components. It is not quite clear if any one of the components might dominate the whole training and make the performance better. An ablation study is required here. Also, are the 4 components equally weighted? It appears equally weighted from the equation but what if the weights are tuned. 2. Though the idea to learn velocity field and divergence is interesting, I am wondering if it might lead to instability during training since velocity field learns the directional vector and divergence learns the expectation of the trace through Hutchinson estimator, which involves a gradient of vector field. What if they do not share the same numerical range? Again, it is better to have an ablation study here. 3. For the experiments, no computation cost and runtime is provided. Following the weaknesses: 1. In table 1, I found it is hard to interpret the results as no method performs consistently well than the others across different sampling steps, either it is existing method or the proposed methods. The results from shortcut distilling method outperforms the shortcut model sometimes. I am wondering if there is actually no pattern and the good performance happens by chance. 2. In Shortcur-Distilll-F2D2, it only talks about the vector field training, how's divergence trained here? 3. How are invalid NLLs computed in table 1? 4. I am not pretty sure why we should learn the divergence term from the beginning. Presumably we can learn instantaneous/average vector fields well, and this will induce the marginal density distribution for each $\rho_t(x)$, then it will automatically the continuity function, which means we know the divergence in the meanwhile. If this is correct, learning divergence is redundant. Please correct me if I am wrong. Fully human-written
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes Fast Flow Joint Distillation (F2D2), a framework that jointly distills (i) a few-step sampling flow map and (ii) a few-step log-likelihood evaluator for flow-matching models by learning a shared backbone with two heads: an average-velocity head and a cumulative-divergence head. The key idea is that sampling and likelihood evaluation requires solving a coupled ODE system that's driven by the same velocity field, making training a joint backbone neural network possible. The method is evaluated empirically with two families of few-step samplers: Shortcut-F2D2 and MeanFlow-F2D2, plus a practical Shortcut-Distill-F2D2 variant that warms up from a teacher flow. On CIFAR-10 and ImageNet-64, F2D2 is able to produce valid NLLs in a few NFEs and has competitive FID. Paper also proposes a simple maximum-likelihood self-guidance procedure that improves sample quality requiring an additional forward and backward pass. - Framing sampling and likelihood evaluation within the flow-map framework for distillation using a shared backbone + two heads to reduce NFE is well-motivated. - F2D2 is shown with semigroup Shortcut and Eulerian MeanFlow variants, demonstrating that F2D2 can plug in to different flow-map methods. - Writing is clear and easy to follow. - Hutchinson trace estimator for divergence, and a staged warm-start are sensible training choices. - The maximum-likelihood self-guidance algorithm is simple and provides insight into inference-time scaling for image generation. - The experiment section says F2D2 yields NLLs “close to the teacher’s BPD,” but several entries (e.g., MeanFlow-F2D2 at 2,8 steps) show BPD values materially below the teacher on CIFAR-10. What's more, the FID for MeanFlow-F2D2 is the best among settings. This raises the question of whether or not training on sampling and training on likelihood estimation are mutually beneficial training objectives, and whether or not the approach of using a shared backbone is justified. - Using a 1024-step teacher as the “reference” BPD does not certify correctness of the few-step NLLs, it only checks consistency with a particular numerical property. An ablation on toy cases where the true likelihood are easily computed and comparing the F2D2 likelihood with the ground truth likelihood would better establish accuracy. - Results are on CIFAR-10 and ImageNet-64 with unconditional generation. Users often cares more about higher resolutions and conditional tasks. The limitations section acknowledges compute constraints, but this still leaves open how the approach scales to large-scale, conditional generation. - The paper argues for parameter sharing with dual heads. To validate the claim, it would be informative to see ablations on: (i) shared vs separate backbones, (ii) the dynamics of each loss term. 1. Can you provide ablations on parameter sharing vs separate backbones and provide teacher quality for Shortcut-Distill-F2D2? What is the variance of the Hutchinson-based divergence estimator and how does this affect NLL stability and bias? 2. The experiment results show that MeanFlow-F2D2 has better FID than MeanFlow at low resolutions. MeanFlow conducted experiments on ImageNet 256x256 and the image quality is better. How does F2D2 perform at higher resolutions and will there be any qualitative failure modes when moving beyond 64×64? You mentioned early-stopping sensitivity for training in the Appendix, and this could raise some questions about scaling up training compute. 3. For Algorithm 2, can you provide experiment results about extending self-guidance beyond one iteration (e.g., x-axis being number of iterations and y-axis being FID/NLL)? Will there be any collapse after one iteration? 4. What is the total wall-clock time and compute resources used during training the divergence head compared to sampler and the teacher? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next