|
Reflective Flow Sampling Enhancement |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
They propose RF-Sampling, a training-free sampling method which enhances sample quality in flow models. The method extends Z-Sampling, which alternates between denoising and inversion steps with controlled CFG, RF-Sampling interpolates between text embeddings and repeatedly applies a reflection loop. This design allows it to support CFG-distilled models such as FLUX-dev and enables test-time scaling. Experimental results demonstrate that RF-Sampling achieves better performance compared to various existing sampling baselines across multiple image generation benchmarks. Additionally, an ablation study on hyperparameters and further application on video generation and image editing model further validates the effectiveness of the proposed method.
**S1**. It seems interesting that the sampling method also supports some of the commonly used CFG-distilled models recently.
**S2**. It demonstrated the effectiveness of the method across various tasks and models, with experiments conducted under multiple settings, e.g., when used in combination with acceleration methods.
**W1**. Although the paper proposes an improved training-free sampling method, it lacks theoretical justification or analysis explaining why it works. A theoretical explanation is necessary to clarify the meaning and role of temporal embedding interpolation, the hyperparameters $\alpha,\gamma$, and the overall underlying mechanism of the method.
**W2**. The improvement in image generation quality appears to be only marginal, and the effect of inference-time scaling shown in Figure 2 does not seem significant. A comparison with the Inference-Time Scaling paper [1] would be necessary to clearly demonstrate the effectiveness of the proposed method.
[1] Ma et al, Scaling Inference Time Compute for Diffusion Models, CVPR 2025
**W3**. The method involves some hyperparameters, which appear to have been selected manually. As shown in the ablation study, the method seems quite sensitive to these choices.
**W4.** Line 232: “We then take one step of the ODE solver,” but the corresponding equation 3 indicates multiple ODE steps with $\sigma$, which could cause confusion.
If my concerns are addressed, I would be happy to reconsider the score.
**Q1.** In Tables 1 and 2, how is the inference process configured for RF-Sampling? Is it conducted using the same sampling time as in the standard setting? For a fair comparison, it would be necessary to include a report on the inference time.
**Q2**. Although FLUX is a CFG-distilled model, it still includes a CFG scale input condition. Would it be possible to apply this sampling method by adjusting the CFG scale through that input? |
Fully human-written |
|
Reflective Flow Sampling Enhancement |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces RF-Sampling, a new inference-time sampling technique that enables trading off computational cost and output quality. The authors take inspiration from Z-Sampling, which introduces a process where a denoising step with a high guidance scale is followed by an inversion step with a low guidance scale to better align the noise with the desired semantics before proceeding with standard denoising. They adapt this idea for CFG-distilled flow matching models and propose a similar approach that applies CFG-like guidance directly on the text embeddings instead. The method is comprehensively evaluated across several text-to-image, text-to-video, and image-editing models, showing improved performance compared to prior techniques.
- The paper is very well written and easy to follow.
- The proposed algorithm in simple to understand and easy to implement.
- The experiment suite is quite extensive with an impressive number of models, benchmarks, and ablations.
- Since the proposed algorithm can be viewed as an inference-time scaling method, including a simple Best-of-$N$ baseline in the experiments would help better position the paper within the literature. For instance, according to Figure 17, performing RF-Sampling on all steps appears to yield the best results. Using the $\alpha = 1$ setting used in the experiments, RF-Sampling requires three times as many forward passes as generating a single sample from the base model. Therefore, comparing this approach against a Best-of-3 baseline from the base model would be a useful addition.
- The paper argues that Z-Sampling cannot be directly applied to CFG-distilled models such as FLUX. However, it is unclear why this would be the case, since these models typically distill a range of guidance values during training. As a result, one could, in principle, perform a variant of Z-Sampling by simply changing the guidance scale values during the denoising and inversion steps.
- Following the previous point, Section 3.2 (Lines 203–205) states: “Flow models are typically trained only under conditional settings (Labs, 2024; Daniel Verdú, 2024). As a result, directly using CFG techniques or adopting an empty-text embedding as guidance for flow models is inappropriate.” According to the cited literature [1], this claim is inaccurate. To obtain a guidance-distilled model such as FLUX-dev, a standard flow matching model is first trained with text embedding dropout to enable guidance. This model is then CFG-distilled into a student that takes the guidance scale as input and outputs the guided velocity in a single forward pass.
[1] Meng, Chenlin, et al. "On distillation of guided diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
1. Could the authors clarify why Z-sampling cannot be applied for FLUX-dev? |
Fully human-written |
|
Reflective Flow Sampling Enhancement |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Reflective Flow Sampling (RF-Sampling), a training-free inference-time enhancement approach for text-to-image (T2I) generation using flow-based generative models, particularly those that are CFG-distilled (such as FLUX). The method leverages an interpolation of text embeddings and a three-stage inference loop (composed of high-weight denoising, low-weight inversion, and standard denoising) to guide the generative process toward better semantic alignment with text prompts. RF-Sampling is demonstrated, through extensive experiments on multiple T2I and related tasks, to yield significant improvements in both generation quality and prompt fidelity compared to standard sampling and several baseline enhancement strategies, particularly where conventional diffusion-based techniques are inapplicable to flow models.
1. The work clearly identifies a real limitation in the applicability of inference-time enhancement techniques to flow-matching-based text-to-image models, a rising class of efficient generative models that are not well-served by prior methods.
2. The approach is mathematically formalized, with explicit equations describing the reflective sampling mechanism (see Eqs. for staged denoising/inversion, Section 3.3), and its integration with flow-based ODE solvers is well-articulated. The staged loop and embedding interpolation are presented in sufficient detail, including the merge and amplification parameters.
3. Experiments are thorough and span a large suite of benchmarks, including HPDv2, Pick-a-Pic, DrawBench, GenEval, T2I-Compbench, and evaluations on video and image editing tasks.
1. While the reflective mechanism is motivated by semantic accumulation in latent spaces and interpolation of embeddings, the core reason behind why the three-stage loop (particularly the low-guidance inversion step) should regularize generation toward prompt-faithful images remains largely empirical. The mathematical foundations of convergence or guarantees (e.g., what class of distributions are targeted, what properties are preserved or enhanced during the reflective step) are not rigorously analyzed in Section 3.3 or elsewhere. This limits reproducibility and makes the method feel heuristic.
2. In Section 3.2, the equations for embedding mixing and semantic amplification ($c_{\text{mix}}$ and $c_w$) are presented, but their concrete integration into each ODE step is scattered and not fully formalized. For example, it is left ambiguous in Eqns. for the inversion step whether $c_{\text{mix}}$ is always being recomputed per time step and how the standard scale $w$ interacts with $s$ and $\beta$. This may impede direct implementation from the text.
3. While Figures 7 and 8 and the corresponding ablation analyses add value, the scope is restricted to interpolation and amplification parameter sweeps. There are no ablations studying the impact of each stage of the loop independently (e.g., what happens if the low-guidance inversion/reflection is omitted entirely, or replaced with a linear or simpler heuristic?), nor are qualitative failure cases or negative results provided. The efficiency and scaling comparisons, though favorable, would be strengthened by a more detailed breakdown versus parameter count, steps, or compute time.
4. While Table 2 shows superior scores for RF-Sampling, in some settings the improvements over standard sampling are marginal (see FLUX-Dev AES on DrawBench: 6.1866 vs. 6.1459), raising questions about practical significance in certain operational regimes.
5. The UMAP analysis in Figure 6 purports to show that RF-Sampling trajectories align better with the true data distribution. However, there is little discussion on how this alignment concretely translates to improved perceptual or semantic outcomes, or whether it is artifactually driven by the chosen projection or data statistics.
1. Can the authors provide rigorous analysis (not only empirical) for why the three-stage reflective loop in RF-Sampling leads to better semantic alignment or image quality than direct/high-weight denoising? For example, can theoretical guarantees or explanations be offered for convergence, robustness, or generalization?
2. The mathematical integration of $\beta$, $s$, and merge ratio $\gamma$ parameters in the flow equation steps can be made more explicit, possibly via an explicit algorithmic pseudocode in the main text. Would the authors include this in a revision?
3. Can the authors elaborate on what prevents adapting state-of-the-art diffusion-based inference enhancement methods (e.g., Z-Sampling, W2SD) to flow models such as FLUX? Are failure cases due to model architecture, incompatible objective/loss, or something else?
4. Is there a detailed efficiency breakdown (step counts, FLOPs, wall time) for RF-Sampling vs. standard sampling (and possible alternatives) beyond the high-level graphs (Fig. 2)?
5. Is the improvement in Table 2 statistically significant across multiple seeds/runs, or is it within experimental noise in lower-difference settings? |
Fully AI-generated |
|
Reflective Flow Sampling Enhancement |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RF-Sampling, a method whose core idea, for flow models like Flux.Dev, is to denoise during inference under text embeddings with higher semantic intensity and then perform inversion under text embeddings with lower semantic intensity. This process helps obtain noise latent that better aligns with the prior of the text prompt, thereby improving image generation quality. The authors conducted experiments using different flow models combined with various sampling enhancement techniques, demonstrating the effectiveness of RF-Sampling.
1. The paper contains a rich amount of experiments, conducting extensive tests under different task-based flow models (including flux.dev and flux.lite for text-to-image generation, Wan2.1 for video generation, etc.), verifying the versatility of RF-Sampling;
2. Exploring inference-time enhancement strategies for cfg distillation flow models like flux.1 dev is promising;
3. The paper is well-written and easy to read.
1. CFG-distilled flow models, such as flux.dev, typically still allow for specifying the CFG scale at inference time to produce outputs at varying guidance strengths (often by modulating the latent via AdaLN). Given this, Z-Sampling should, in principle, be applicable to flux.dev. Why do the paper not compare their method against a Z-Sampling variant adapted for this model? Theoretically, the output of a perfectly distilled model at cfg_scale=1.0 should be identical to the output of its non-distilled counterpart conditioned on a empty-text prompt. This suggests that the paper's construction of $c_{mix}$ and $c_{w}$ might be unnecessary, as Z-Sampling could likely be migrated to flux.dev with appropriate modifications.
2. The image quality metrics used in the paper, such as ImageReward and HPSv2, primarily reflect human preferences. Such metrics tend to emphasize aesthetics and prompt fidelity, but may not adequately assess the diversity or realism of the generated images. I am concerned that the proposed RF-Sampling, by weighting text embeddings and latents during the sampling process, might shift the model's inputs away from their original prior distribution. This could potentially lead to a decrease in image diversity or the introduction of visual artifacts. Using a metric such as FID, which directly assesses realism and diversity, would perhaps be more appropriate
3. For a standard 28-step sampling process, each step of RF-Sampling appears to require three model forward computations (forward, inversion, and re-forward). This effectively triples the computational cost relative to the number of steps. Consequently, the comparisons in Tables 1 and 2 may be unfair. The results for RF-Sampling should be compared against a baseline standard sampler using three times the number of inference steps (e.g., $28 \times 3 = 84$ steps). While Figure 2 suggests RF-Sampling also performs better under an equivalent inference time, the specific experimental settings (e.g., the number of steps for the baseline) are not detailed, raising concerns about the fairness of this comparison.
4. The different starting points for the standard sampling and RF-Sampling curves in Figure 2 indicate that RF-Sampling incurs a significant initial overhead or increase in inference time. A more relevant comparison would be: using the *total* time taken by RF-Sampling, how does it compare to a standard sampling baseline that generates multiple candidates ('best-of-N') and selects the one with the highest metric score?
5. ICLR policy requires the appendix to be included in the same PDF as the main paper. The paper has incorrectly placed the appendix in the separate supplementary materials.
I am willing to revise my score if any of these points are based on a misunderstanding of the proposed method.
See Weaknesses. |
Fully human-written |