ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	0 (0%)	N/A	N/A	N/A
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	1 (25%)	8.00	2.00	2342
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	3 (75%)	5.33	3.00	5098
Total	4 (100%)	6.00	2.75	4409

Title	Ratings	Review Text	EditLens Prediction
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes DriftLite, a lightweight alternative to reward-driven particle dynamics derived from the Feynman–Kac formulation of the Fokker–Planck equation. While the principled dynamics (Prop. 2.1) yield unbiased sampling by reweighting trajectories according to a reward function, they suffer from severe weight degeneracy and are impractical for high-dimensional problems. DriftLite replaces explicit weighting with a Variational Coefficient Generator (VCG) that learns a low-rank, three-basis decomposition of the drift field, achieving a balance between theoretical faithfulness and numerical stability. The method demonstrates stable performance on challenging high-dimensional tasks, including protein–ligand systems. ## Strong theoretical grounding. Proposition 2.1 provides a clean derivation from the Feynman–Kac representation, clarifying why naive guidance or reward-based drift correction leads to unnormalized density propagation. In addition, Proposition 3.1 theoretically provides design space within the Fokker-Plank equation. ## Elegant practical solution. The transition from weighted to unweighted dynamics via the VCG formulation is conceptually neat and computationally efficient. Representing drift using three physically motivated basis components—potential, diffusive, and reward—yields a low-rank parameterization. Directly minimizing the variance of the potential term under a simple parameterization reduces the problem to a linear system, making the overall procedure remarkably simple. ## Empirical validation on high-D systems. Although the approach projects the control drift $b$ onto a three-dimensional subspace, it scales to complex molecular and protein–ligand environments without losing stability, indicating that this design effectively captures the dominant drift modes. ## Partial theoretical exposition. While Prop. 2.1 and 3.1 are elegant, the paper omits intermediate derivations linking the reward term to the weighted dynamics; some readers may struggle to follow the jump from theory to implementation. Could the authors clarify whether the choice of three basis functions in the VCG has a theoretical grounding? In particular, does this low-rank representation guarantee that the dominant drift modes of the reward-driven dynamics are captured, or is it mainly an empirical observation?	Moderately AI-edited
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces DriftLite, a training-free method for steering diffusion models at inference time using an improved sequential Monte Carlo (SMC). To mitigate the path degeneracy problem common in SMC, the authors derive an optimal, variance-minimizing drift for the governing Feynman-Kac PDE and propose two practical approximations: Variance-Controlling Guidance (VCG) and Energy-Controlling Guidance (ECG). These approximations can be obtained by solving a linear system at each sampling step, adding only a small computational overhead. The methods are analyzed on Gaussian mixture models, where they are empirically shown to reduce variance and improve sample quality. They are also tested on practical benchmarks, molecular systems and protein-ligand co-folding, demonstrating improved inference-time steering capability over the standard SMC baseline. 1. The paper is clearly written, and the core mathematical arguments are generally well-explained. 2. The work tackles a significant but often-neglected challenge of the weight degeneracy (or path degeneracy) in SMC-based diffusion steering. 3. The paper translates a mathematical insight into novel and practical (training-free) algorithms. 4. Their experiments are well-designed to support their arguments. The proposed methods clearly outperform the SMC baseline on real-world benchmarks like molecular systems and protein-ligand co-folding. 1. The method's practical limitations could be discussed in more detail. It introduces a significant computational overhead, with experiments showing up to a 6x increased runtime over the SMC baseline. Furthermore, the reward-tilting framework fundamentally relies on access to the gradient of the reward, which is inaccessible in many black-box applications, limiting its practical scope. 2. The paper would be strengthened by adding empirical analysis of the approximation error from VCG and ECG. 3. No source code is provided. 1. I have little expertise in functional analysis and found the formal proof of Proposition A.5 difficult to follow. Could you provide a more intuitive explanation for why solving the Poisson equation (Eq. 3.2) yields the optimal control? 2. Line 269, "... where reweighting can be unstable": Aren't VCG and ECG without resampling still weighted with path-level weights? Those weights should still have high variance. I don't understand the rationale behind this (if you want to mitigate the path degeneracy from resampling, then you can consider using tempered weights for resampling, e.g., [1]). 3. Why were different $\gamma$ values used for each figure 1, 2, and 3? 4. What is the main computational bottleneck of the algorithm in practice? (Hutchinson's estimator? or solving the linear equation?) 5. If $r$ is given by a large neural network, can this method still be used (in terms of memory and time complexity)? 6. Why didn't you consider the ALDP experiment (which is more multi-modal compared to LJ-13, to my knowledge)? 7. (suggestion) Line 417, "from $T=1.0$ to $0.4$": it might be better to say "with $\gamma = 2.5$" to avoid any confusion. --- References [1] Choi, Sanghyeok, et al. "Reinforced sequential Monte Carlo for amortised sampling." arXiv preprint arXiv:2510.11711 (2025). --- LLM usage disclosure: I used LLM to check the grammar and make each sentence clearer.	Fully human-written
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose a method for learning to sample from a combination of tempered and/or tilted distribution, where the exponent of a reward function describes the tilting factor. Importantly, they note that the Feynman Kac PDE can be manipulated by adding an additional drift term which is compensated in the re-weighting factor, reducing the problem of sampling from the new target density to finding an optimal drift. It is an important direction of research since re-weighting typically relies on self-normalized importance sampling which can lead to importance weights blowing up. Thus, finding a good additional drift function can reduce this problem by essentially reducing the variance of the importance weights. The authors further demonstrate the improved performance of their framework, which relies on efficiently finding the optimal control through solving of a linear system of equations. - The authors provide a rigorous and correct assessment of the setting, where existing methods rely on self-normalized importance sampling which suffer from the problem of large variance of importance weights and the sensitivity to the number of particles used for integration. - The experiments indicate that the proposed method leads to improved performance across a wide array of settings, substantiating the generalizability of the proposed method. - A lot of the theory proposed in this work is actually already well known and not new, and the authors should have introduced it as such. In particular, Prop 3.1 is the basis of [1-3] which highlights that the Feynman Kac PDE can be manipulated with additional drifts as long as they are similarly compensated for in re-weighting. - Is there a reason why the authors did not consider the more difficult setting of LJ-55 to evaluate their framework? - I strongly urge the authors to provide comparisons to [2-3] which follows the same idea, learning a neural network model as an additional drift to reduce the variance of importance weights. A clear comparative analysis between modeling the drift as a linear coefficient of basis functions or a neural network would make this manuscript a lot better. - In continuation of the previous point, it is further important to highlight the training and inference costs of the proposed method against [2,3]. While I agree that the proposed method only requires solving a system of linear equations, it does require this at all time-points every time during inference. On the other hand, learning a time-conditioned neural network requires an upfront cost of training but then at inference only requires an additional forward pass at each step. Is solving this system of linear equations at every step cheaper than a forward pass at each step? Such an analysis has been missing from this work. [1] Skreta, Marta, et al. "Feynman-kac correctors in diffusion: Annealing, guidance, and product of experts." arXiv preprint arXiv:2503.02819 (2025). [2] Albergo, Michael S., and Eric Vanden-Eijnden. "Nets: A non-equilibrium transport sampler." arXiv preprint arXiv:2410.02711 (2024). [3] Vargas, Francisco, et al. "Transport meets variational inference: Controlled monte carlo diffusions." arXiv preprint arXiv:2307.01050 (2023). - How do the authors evaluate their method on LJ-13 and the other benchmarks? How are the ground-truth samples obtained to compute metrics such as MMD? - It would be good to provide some of the standard metrics like 2-Wasserstein distance. - What is the reward-tilting used in LJ-13, and what is the motivation behind the specific choice of tilting applied? - Why do the authors not compare against reward-based diffusion finetuning tasks and methods, e.g. [4-6]? [4] Venkatraman, Siddarth, et al. "Amortizing intractable inference in diffusion models for vision, language, and control." Advances in neural information processing systems 37 (2024): 76080-76114. [5] Domingo-Enrich, Carles, et al. "Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control." arXiv preprint arXiv:2409.08861 (2024). [6] Fan, Ying, et al. "Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models." Advances in Neural Information Processing Systems 36 (2023): 79858-79885.	Fully human-written
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper extends previous work on adapting diffusion models at inference-time to new tasks using Sequential Monte Carlo (SMC) methods, focusing on reward-tilting and sampling from an annealed version of the base model distribution. In particular, they target the problem of sampling from the sequence of distributions $q_t(x) = p_t(x)^\gamma \exp(r_t(x))$, where $p_t(x)$ is the original diffusion model distribution and $r_t(x)$ is some time-varying reward function, such that at $t=0$ we approach some predefined clean data distribution $p_0(x)^\gamma \exp(r(x))$. Following previous work, $N$ particles are sampled concurrently for each $t$, and the diffusion model is used to move these particles towards lower $t$ in the $q_t(x)$ chain. The resulting proposal particles are reweighted using Feynman-Kac-type equations and periodically resampled, enabling the particle ensemble to approximate the sequence of target distributions. The new idea in the paper to analyze the Feynman-Kac-type equations and notice that there is additional freedom in choosing the proposal distributions, such that they can add an additional term to the drift in the Feynman-Kac equations if they correspondingly change the time evolution in the weights. The authors then show that there exists an optimal added drift such that it is the gradient of a scalar potential that follows a particular PDE. With this added design space, they propose two choices for the added drift: 1) Minimizing the variance in the change of the weights 2) minimizing a Lagrangian of the aforementioned PDE such that at the minimum, the optimal drift is obtained. In practice, they parameterize the added drift as a linear combination of $\nabla r_t(x), \nabla \log p_t(x)$, and the original diffusion drift $u_t(x)$ for the variance-minimizing version, and a similar choice for the PDE-optimizing version. This results in a simple $3x3$ systems of equations per particle per step, resulting in a relatively lightweight correction, although evaluating the required matrices requires backpropagation through the diffusion denoiser. The authors evaluate the method on a 30-dimensional Gaussian Mixture Model, two toy particle systems, and on a protein-ligand co-folding task. They notice that the method consistently improves effective sample size and reduces weight variance, outperforming the naive SMC baseline. On the protein-ligand task, they demonstrate that using physical energies as the rewards, the method can improve the physical validity of the generated samples. - The core idea of defining a correction to the proposal distribution that directly optimizes the variance or consistency with the target distributions, is principled and seems to be novel at least in the diffusion context. It seems to be quite simple to implement, does not cause a huge overhead, and seems to legitimately improve the performance of the SMC method considerably on the tasks chosen. - The experiments on the GMM, DW-4 and LJ13 are quite thorough and well-presented, and the protein-ligand task is an example of a realistic problem that the method could help with and the method clearly works better than more naive SMC approaches. - Overall, the paper is well written and not too difficult to follow, with the caveat that the mathematical exposition may be difficult to follow to readers that are not familiar with this way of describing SMC methods in advance. Disclaimer: I am not an expert on SMC or SMC-based methods for diffusion models, and as such it is possible that I am missing some details, wider context or some previous literature. As such, I am open to being corrected. - The method incurs significant cost on the wall-clock time compared to the baselines, 2.3x-6x in the experiments for which the wall-clock time was reported. This is not mentioned in the limitations or in the main paper, however, and the conclusion instead states that the method causes minimal computational overhead, potentially misleading the reader or a reviewer who does not have the time to read through the Appendix. Backpropagation through the denoiser at each step requires much more memory than the forward pass and, and my assumption is that it may limit the batch size in practice. - As such, it might be more fair to compare to the baselines by adjusting the inference hyperparameters such that the wall-clock time is equal. E.g., I would expect that the G-SMC method improves with more particles, and potentially with more diffusion steps. - Following that, I would be interested in seeing the ESS and wall-clock time comparisons for the protein-ligand task as well, especially since this seems to be an example of a higher-dimensional case. As an alternative, it would also be interesting to see how the method scales, e.g., to the LJ55 task or some other task that is more than about 30-dimensional. - Do the authors agree that the method could be described as a "twisted" proposal for SMC? And further, even the initial choice of using the modified diffusion reverse process would be also simply another choice of a twisted proposal. My understanding is that this is a standard idea in the Sequential Monte Carlo literature, although the basis function scheme and the variance-minimization is novel at least in the diffusion context to my knowledge. The problem is that this seems to set the first contribution of 'identifying a fundamental degree of freedom in the Feynman-Kac type FP equation' to be less novel than it sounds, but I am open to being corrected on this. In any case, I think that positioning the method in its proper wider SMC context would be helpful for correctly contextualizing the paper. - In the protein-ligand, task, would another simple baseline be to take the base model samples, and optimize them directly using the gradient of the energy function? Does the method outperform this? - The paper [1] seems to be a relevant prior work on annealed distributions with diffusion models with SMC (and is the first paper to do this?). It may be that the proposed method is better than this, but I think it should be compared to at least in some context to clearly show this. - Further, the paper [2] does reward-tilted sampling with SMC by using a different twisted proposal by using the $p(y\|x_0(x_t))$ gradient. It would be good to compare to that as well in the reward-guided experiments. References [1] Thornton, James, et al. "Composition and Control with Distilled Energy Diffusion Models and Sequential Monte Carlo." The 28th International Conference on Artificial Intelligence and Statistics (2025). [2] Wu, Luhuan, et al., "Practical and Asymptotically Exact Conditional Sampling in Diffusion Models", 37th Conference on Neural Information Processing Systems (NeurIPS 2023). - Although not considered in previous literature on SMC, it seems that another simple baseline would be to not use SMC and instead interleave Langevin dynamics steps after each generative step, similarly to [1], such that the wall-clock time is matched with the proposed method. I would be curious to see how well this could work on some task, although this is not a priority. Overall, I will start out with a marginal reject due to the concerns raised in the weaknesses, but am open to changing the score with the rebuttal. I think the core idea in the paper is interesting and mostly seems to work well and potentially could be extended further. As such I am not initially strongly against accepting the paper with the assumption that the weaknesses are cleared out or I am incorrect about them. References [1] Du, Yilun, et al. "Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc."	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next