ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 10.00 4.00 2134
Lightly AI-edited 2 (50%) 7.00 4.00 1634
Fully human-written 1 (25%) 6.00 4.00 1947
Total 4 (100%) 7.50 4.00 1838
Title Ratings Review Text EditLens Prediction
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 10: strong accept, should be highlighted at the conference Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposed TempFlow-GRPO, a framework that makes the optimization process temporally aware to address the key limitation of temporal uniformity in previous RLHF works. The paper introduces a mixture of ODE and SDE sampling, along with a noise-aware policy weighting scheme, to balance exploration and reward exploitation. Experiments demonstrate that TempFlow-GRPO achieves state-of-the-art performance, yielding higher rewards than standard GRPO approaches. - The paper pinpoints temporal uniformity as the primary limitation of existing flow-based GRPO methods and proposes TempFlow-GRPO to solve it with precise credit assignment and noise-aware optimization. The authors demonstrate this non-uniformity well with empirical evidence from rewards, supporting the need for temporal information. - The paper introduces the core mechanisms of trajectory branching and noise-aware reweighting to create temporally-structured policies that respect the dynamics of the generative process. The authors also provide a theoretical justification from the policy gradient perspective, further supporting the use of noise-aware reweighting. - The proposed TempFlow-GRPO achieves state-of-the-art performance compared to the existing vanilla GRPO approach, demonstrating the effectiveness of the method. The authors also include comprehensive ablation studies to better understand the dynamics of this model. - The computational cost, as thoroughly analyzed in Appendix A.6, will be higher than the vanilla GRPO models due to the branching process. Nonetheless, this is more like a trade-off between quality and time, given the superior quality metrics. - How is the performance affected by the number of branches (K) at each step, the specific timesteps chosen for branching, or the exact function used for noise-aware weighting? The ablation study (Fig. 8) shows that the 4x6 (seed x branch) configuration was chosen, but it's unclear how much tuning is required to find the optimal setup for a new model or dataset. A discussion on how to choose these hyperparameters will be useful for general applications of the proposed framework. Moderately AI-edited
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the sparse terminal reward and uniform credit assignment problem in GRPO training of flow models. The authors propose TempFlowGRPO, which includes: (1) Trajectory Branching, where only one step of SDE is used at timestep k; (2) Noise-Aware Policy Weighting by reweighting according to noise level; and (3) a seed group strategy. The method achieves state-of-the-art performance in human preference alignment and text-to-image benchmarks. 1. The authors astutely identify that the FLOW-GRPO algorithm treats all timesteps equally, and tackle this issue via single-timestep SDE optimization. 2. The noise reweighting method is shown to be effective through both soild theoretical analysis and experiment results. 3. The paper is generally well written with a clear logical structure. 1. The contribution of seed group strategy is relatively small to other parts of the work, and the paper should provide additional details of the seed group strategy. 2. Similarly, MixGRPO [1] proposes a training window of SDE time steps that also tackles the issue of treating all timesteps equally. However, there is limited discussion comparing with MixGRPO. 3. The paper does not discuss the phenomenon of reward hacking, which is an inevitable problem for the GRPO method. [1] Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde 1. The trajectory branching mechanism appears similar to MixGRPO limited with a single-timestep window. How do their efficiency and effectiveness compare? 2. The paper claims that Flow-GRPO (Prompt) is an improved baseline with group standard deviation stabilization, but does not provide much detail. Could the authors elaborate on this improved method? 3. Why are the Pickscore curve trends by steps and GPU hours on the left of Figure 3 inconsistent? 4. Compare to FlowGRPO, the experiment of Visual Text Rendering is not addressed. How well does TempFlow-GRPO perform on this particular task? Fully human-written
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. TempFlow-GRPO is a temporally-aware reinforcement learning framework for flow matching models that improves human preference alignment by introducing trajectory branching, noise-aware weighting, and seed grouping to achieve precise credit assignment and efficient optimization across timesteps. For reinforcement learning tasks, dense rewards are crucial for effective credit assignment. The proposed Trajectory Branching mechanism provides an elegant and effective way to obtain dense rewards along the denoising trajectory. The introduced reweighting mechanism offers a valuable analysis of how gradients evolve across steps in baseline algorithms and presents a solution to mitigate the identified issues. The proposed method involves numerous ODE denoising steps, which substantially increase computational overhead. However, the paper lacks a comparison against the baseline method using training time as the horizontal axis to illustrate efficiency trade-offs. The authors should evaluate the performance of the reweighting mechanism under different $\sigma_t$ schedulers rather than relying solely on the one used in Flow-GRPO, to examine how the choice of scheduler influences its effectiveness. It remains unclear whether simply reweighting the coefficients in the earlier part to 1 would yield good results under different schedulers. The comparison between batch std and global std is only evaluated on PickScore. How does this observation generalize to other tasks? Can the proposed reweighting mechanism be applied to hybrid variants (FlowGRPO-Fast/MixGRPO) where only a subset of steps follows an SDE formulation? Lightly AI-edited
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents TempFlow-GRPO, a new reinforcement learning framework that addresses the limitation of uniform credit assignment across timesteps. The method introduces trajectory branching, which switches from ODE to SDE sampling at selected timesteps to generate exploratory branches and assign their rewards to intermediate states. This paper further proposes noise-aware policy weighting, prioritizing optimization at high-noise early stages over low-noise refinement phases. Experiments show that TempFlow-GRPO achieves substantially improved efficiency and final performance compared to the baselines. - The paper is overall well-written and easy to follow. - The motivation and the proposed method are clear and straightforward: addresses the temporal inhomogeneity and credit assignment problems through intermediate resampling for intermediate value estimation and noise-aware reweighting. - The proposed method shows strong empirical performance in both efficiency and end-level performance, with comparisons that include GPU time. - Theorem 1 is intuitively reasonable, but labeling it as a Theorem feels overstated since the underlying assumptions and proof sketch are insufficiently formalized. The analytical depth is also somewhat limited. - The explanation around line 847 (regarding why the average number of branches is 4.5× when K = 10) is unclear. It is not obvious how this factor arises or how the branching schedule operates, and the paper does not explicitly describe it. - Adding more algorithmic details or pseudocode would improve readability and make the proposed procedure easier to follow. See weaknesses. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next