ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (25%) 6.00 3.00 5158
Moderately AI-edited 1 (25%) 0.00 4.00 5976
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 7.00 2.50 3848
Total 4 (100%) 5.00 3.00 4707
Title Ratings Review Text EditLens Prediction
Test-time scaling of diffusions with flow maps Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The author introduce Flow Map Trajectory Tilting (FMTT), a method for sampling from reward-tilted probability distributions in the setting of flow-map-based generative models. The most widely known method for doing this in the case of diffusion models is classifier guidance, which consists of adding the gradients of a learned reward function to the score predictions of a diffusion model to guide samples towards regions of high reward. While this approach is intuitive, it needs to employ heuristics to compute rewards for noised latents, as the pre-trained reward function is trained only on noise-free samples. The authors aim to bridge this gap by considering generative models parameterized by flow maps (i.e. maps $X_{s, t}$ associated with an ODE such that, for any solution $x$ of this ODE, $X_{s, t}(x_s) = x_t$). In this setting, at noise level $s$, an estimate of the noise-free latent can be computed as $X_{s, 1}(x_s)$ (where $t=1$ corresponds to the data distribution). The reward can then be evaluated at this estimate, and its gradient can be added to the score (which is also obtainable from the flow map). However, directly applying this approach does not produce samples from the correct reward-tilted data distribution. Rather, the authors show the dynamics have an extra additive term that is not present in the dynamics of the true reward-tilted measure of interest. They propose correcting for this additional term through importance sampling. The importance sampling weights are obtained from the Jarzynski estimator, which takes on a simplified form in the specific case of interest. The authors then propose using Sequential Monte Carlo to sample trajectories with these importance weights, hence producing unbiased samples from the true reward-tilted measure. Empirically, FMTT shows (i) modest GenEval gains on human-preference rewards; (ii) strong improvements on geometric/structural rewards (masking, symmetry, rotation), where flow-map look-ahead outperforms denoiser/diffusion look-ahead; (iii) the first effective use of vision-language models as rewards for natural-language alignment/editing. - Important problem: reward conditioning in diffusion models and related generative models is a fundamental question with far-reaching implications for various fields like image and video generation, robotics and scientific applications. - Non-trivial, theoretically principled technique: the construction (and comprehension) of FMTT demands a fair amount of sophistication in stochastic analysis, PDEs and Monte Carlo methods. As such, most researchers in the field would not have been able to arrive at these ideas on their own, meaning this paper adds counterfactual value to the field, provided the techniques’ effectiveness can be conclusively demonstrated. - Unclear motivation for why flow maps are more appropriate than diffusion models for reward guidance: in section 2.2, the authors claim that passing a noisy latent $x_s$ through a denoiser $D$ before computing the reward is inappropriate due to the denoiser providing little information early on in the dynamics. It is not a priori clear to me why using the flow map $X_{s,1}$ instead would not share the same problem for small $s$. It would be good if the authors could clarify this. - The presentation makes comprehension difficult for those not already familiar with the technical tools used in this work. For example, the section on SMC and thermodynamic length is very difficult to understand if one is not already familiar with SMC and related concepts, even having a background in diffusion models and stochastic analysis. From reading Proposition 2.3 in the main paper, it is not clear what thermodynamic length is or what intuitions it corresponds to. Similarly, it is not intuitive how one arrives at the SDE for the importance weights in Equation 21. Rephrasing these technical sections with a bigger focus on intuition would likely help technical readers who are not experts in the specific technical tools used here to better understand the method. I believe improving this will be important to ensure the broader community can understand the method and consider adopting it or building on it. It would also be good to provide background on SMC in section 2.1. - Scope of evaluations. Many results are qualitative for specialized rewards; quantitative studies (e.g. multiple datasets, objective quantification of output quality) would strengthen claims beyond case studies. There exists literature on using diffusion for solving constraint satisfaction problems; see e.g. https://arxiv.org/abs/2211.15657 . Such setups could be adapted here to demonstrate whether FMTT leads to practical improvements in constraint satisfaction in these environments where success is likely easier to measure, compared to image generation. - Lack of a systematic account of how performance scales with NFE: given that the paper positions itself as a test-time scaling technique, it is important to have a sense of how performance varies with the amount of compute used at inference time. NFE numbers are reported in Table 1, but a more rigorous comparison would give a better sense of e.g. whether FMTT Pareto-dominates other baselines, or whether it only performs better if significant compute is expended. See the Weaknesses section. I consider this paper to be interesting from a theoretical and conceptual perspective, but it is , in its current form, held back (in my opinion) by unclear exposition and insufficient evaluations to establish the practical performance of FMTT relative to other approaches in a systematic, quantitative way. I believe it can be possible to address this in the rebuttal, and I am open to increasing my score if the authors extend their qualitative comparisons to quantitative ones, include performance-vs-compute scaling charts for FMTT and relevant baselines in some of the tasks considered in Table 1, and make the exposition clearer regarding motivation and SMC. Fully human-written
Test-time scaling of diffusions with flow maps Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Flow Map Trajectory Tilting (FMTT), an inference-time (test-time) guidance procedure for diffusion / flow-based generative models. The core claim is that using an estimated flow map X_{t, 1} to "look ahead" to the terminal sample allows you to (1) inject reward gradients more meaningfully throughout the entire trajectory (even at early timesteps), and (2) construct importance weights whose form is allegedly simpler and theoretically grounded by a Jarzynski-style argument. The paper further claims that this enables both unbiased sampling from a tilted distribution and practical search for high-reward generations, including rewards defined by VLMs. I think the idea is straightforward to follow. And the proof process is correct and easy to follow ( I checked all the derivations in section 2 apart from proposition 2.1 & 2.2). 1. The method is mostly an application of known SMC / importance weighting ideas to reward-guided diffusion, swapping in a learned flow map to get a better guess of the final clean sample. The paper over-markets this as fundamentally new. 2. The theoretical section restates standard Jarzynski/SMC logic but does not analyze estimator variance, practical degeneracy, or approximation error in the learned flow map — which are the actual hard problems. 3. The paper leans on buzzwords (“test-time scaling,” “thermodynamic length”) and selective qualitative figures instead of giving a sober, controlled, reproducible evaluation. 4. The utilization of the flow map is already a not-fresh idea at all. We have seen a bunch of work using a similar idea, such as mean flow and consistency models. 5. It lacks discussion about the motivation and intuitions of the methodology. Considering the fact that the discussed "test-time adaptation of diffusions" is already proposed by other work, the actual contributions of this work are just introducing the Flow map look-ahead and the bias correcting via importance weighting w.r.t. diffusion paths. 5. Empirical evaluation is not convincing at all. The experiments are positioned as if they demonstrate strong practical wins. In reality, they’re narrow, sometimes undercontrolled, and in places close to cherry-picked. a. MNIST “tilt to zeros”. The MNIST experiment is extremely weak as evidence. They “tilt” an unconditional model to make it generate images classified as the digit “0”, then report perfect classification accuracy and nicer thermodynamic length for their method. But MNIST digit steering with classifier gradients is a toy problem that nearly any conditional guidance trick will solve; it does not stress high-dimensional or semantic alignment. It is not credible evidence for claims about modern text-to-image alignment or test-time scaling in large vision-language reward scenarios. b. GenEval and human-preference rewards. The gains they claim for FMTT over strong selection baselines (best-of-N, multi-best-of-N, beam search) are marginal and sometimes inconsistent across sub-metrics. In several columns, multi-best-of-N or beam search are already competitive. The paper itself admits that FLUX.1-dev is already post-trained on human preference data, so there's limited headroom. This basically undercuts their own main quantitative table: if your headline benchmark is already saturated, then it's not a good benchmark to demonstrate superiority. c. There is no cost-quality frontier analysis. The method is pitched as “test-time scaling,” i.e. spend more inference compute to get better reward. But the paper does not plot reward vs. NFEs (function evals) for all baselines. Instead, it drops an NFE column in the table, but does not argue Pareto optimality. For instance, best-of-N with enough samples obviously improves, but they don't show how many samples FMTT effectively needs to match that reward level. Without compute-normalized curves, it's impossible to judge whether the method is actually more compute-efficient than naive sampling + selection. d. The ablations (“FMTT - 1-step denoiser lookahead”, “FMTT - 4-step diffusion lookahead”) are potentially interesting, but they’re only reported as single numbers, with no statistical significance, no variance bars, and no visibility into failure cases. We don’t see if flow-map lookahead is robust across prompts or just happened to help on a subset of lucky prompts. In a nutshell, empirically, the paper leans heavily on selective visuals and small, convenience-scale tasks. It does not provide rigorous, large-scale, statistically grounded evidence that FMTT is (a) reliably better, (b) more compute-efficient, or (c) more robust than strong existing inference-time steering / best-of-N search baselines. Indeed, another series of work that I am more familiar with is "diffusion finetuning" (https://arxiv.org/html/2510.02692v1). I think this work's conception, "test-time scalling diffusion" is very similar to the "diffusion finetuning". Can the author explain the difference between these two lines of work? If there is no difference, can the author explain why they adopt this title instead of "diffusion finetuning"? I suspect the authors are overclaiming just to force a connection to “test-time scaling,” and I really dislike this kind of behavior. I think the bias of the sampling process with the reward terms introduced is not a fresh conclusion at all from the perspective of "diffusion finetuning". Besides, for the visualization, indeed, I have tried myself with those prompts provided by authors. I found that, at least for GPT5, it can work well for those prompts shown in figure 3. The clock prompt in Figure 1 is indeed problematic for GPT5. I think the author should also conduct a detailed ablation study about the precision regarding the guidance as well as the generation quality. There are three typos I found: 1. line 191 "so that so that". 2. line 312, "proposition 3.2" 3. line 312, "proposition" lacks a hyperlink. Moderately AI-edited
Test-time scaling of diffusions with flow maps Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Flow Map Trajectory Tilting (FMTT), a novel algorithm for guiding diffusion models at test-time to generate samples that align with user-specified rewards. The core problem addressed is that rewards are often only well-defined for clean, fully generated data, making it difficult to apply reward gradients during the noisy intermediate steps of the diffusion process. Traditional methods approximate the final output using a denoiser, which is often inaccurate, especially in the early stages of generation. The key contributions of this work are: 1. It proposes using a flow map for a precise "lookahead" capability, allowing the model to accurately predict the final output from any intermediate noisy state. This solves the ill-posed nature of applying rewards mid-generation. 2. It develops the FMTT, which integrates the gradient of the reward (evaluated on the accurately predicted future state) into the generative SDE, provably performing better ascent on the reward than standard methods. 3. The method provides a principled way to perform either extract sampling of the reward-tilted distribution via importance weighting or effective search for high-reward samples. 4. It demonstrates state-of-the-art performance on complex image generation tasks that require precise control and is the first work to successfully use pretrained VLMs as complex, language-defined reward functions for test-time guidance. This paper introduces a novel method called Flow Map Trajectory Tilting (FMTT) to address the core challenge of guiding diffusion models at test-time to generate samples that better align with a user-specified reward. Its primary innovation is replacing the heuristic "look-ahead" of traditional methods, which rely on one-step denoisers, with an "exact look-ahead" provided by a flow map. This allows the model to predict the final generated output at any point during the trajectory, providing a much stronger and more accurate signal for the reward function, especially in the early, noisy stages of generation. The approach is theoretically principled, provably performing better reward ascent than standard techniques while also enabling unbiased, exact sampling of the target distribution through a simplified importance weighting scheme. The method is validated with high-quality and extensive empirical results. Experiments demonstrate that FMTT successfully generates images satisfying complex geometric constraints, such as symmetry and rotation invariance, where strong baseline methods fail. On the standard GenEval benchmark, FMTT achieves state-of-the-art performance, outperforming previous gradient-free and gradient-based search methods. Furthermore, the paper introduces "thermodynamic length" as a metric for sampling efficiency, showing that on a targeted MNIST generation task, FMTT not only achieves perfect classification accuracy but also has the lowest thermodynamic length, empirically confirming the method's efficiency. This paper presents a highly original, significant, and well-executed contribution to the field of controllable generative models. The paper is very strong, but there are a few limitations or areas that could be further explored: Dependency on Flow Maps: The primary requirement of the proposed method is the availability of a trained flow map. While the paper successfully uses a model distilled from FLUX.1-dev, this dependency means the method is not universally applicable to any off-the-shelf diffusion model that is only trained to predict noise or velocity. The practicality and cost of training or distilling a high-quality flow map for other architectures could be a potential barrier to adoption. Computational Overhead of Search: The experiments demonstrate superior performance but could benefit from a more direct analysis of the trade-off between computational budget and reward improvement. 1. Could you elaborate on the sensitivity of FMTT to the quality of the flow map? How does the performance degrade if the flow map is less accurate, for instance, if it is distilled into a 1- or 2-step sampler instead of a 4-step one, or if it is trained with less data? 2. The paper demonstrates FMTT's effectiveness as a standalone guidance method. However, in many applications, other forms of "information injection" are used to control generation, such as ControlNet for spatial layout or IP-Adapters for style. Have you considered or tested the compatibility of FMTT with these methods? Could reward-based guidance from FMTT be combined with these other control mechanisms, or do you foresee potential conflicts in how they steer the generative process? 3. The experiments are based on a flow map distilled from the FLUX.1-dev model. Since its release, improved architectures within the same family, such as HiDream, have been developed. How do you expect FMTT would perform when applied to these more advanced base models? Is the effectiveness of the flow-map lookahead tied to specific properties of the original FLUX.1-dev, or would the benefits of FMTT be broadly applicable and potentially even enhanced when paired with a stronger foundation model? Heavily AI-edited
Test-time scaling of diffusions with flow maps Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 10: strong accept, should be highlighted at the conference Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a method to improve test-time sampling of diffusion models according to arbitrary rewards. The method is based on flow maps, which have two time arguments and can be used to "jump" to approximations of the solution rather than require full denoising, and this is leveraged to apply the reward model to denoised samples, rather than noisy ones (which would be out-of-distribution). Several qualitative and quantitative results are presented. This paper seems to present a well-reasoned proof that flow maps can be used to apply rewards to diffusion models with look-ahead, so that the rewards are more accurate than if they were applied to the noisy samples near t=0. While I did not check all the technical details, I did not find any errors in the math, and it makes sense. The experimental results are excellent, and show multiple creative uses of the rewards, especially in ways that are out of distribution for previous benchmarks. Barring any mistakes (which would be surprising given the good experimental results), this seems like a very solid contribution to the field. I did not find any major weaknesses. I only have a few suggestions for improving the clarity slightly: - Fig. 2 needs labelling of the time axis. - Nabla is overloaded, both as an operator and stand-alone symbol (composed with dot product); although it is common notation, due to the overloading it would be safer to define it in all cases. - Fig. 5: It would be more convincing to present a scatter plot of thermodynamic lengths vs. reward across samples, instead of just average bar plots, which is much weaker evidence for the conclusion. I have no further questions for the authors. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next