ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 4719
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 4.00 1735
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 6.00 3.00 2659
Total 4 (100%) 5.00 3.25 2943
Title Ratings Review Text EditLens Prediction
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a finetuning-free, plug-and-play alignment strategy for diffusion models in text-to-image generation by casting the problem as sampling from a reward-weighted distribution. The authors analyze the challenges of existing guidance-based alignment schemes—particularly the emergence of adversarial artifacts—and propose a novel regularization for guidance signal stabilization. The method is evaluated on established text-to-image benchmarks, achieves strong alignment to human preferences using a lightweight guidance network, and demonstrates substantial computational savings over finetuning-based approaches. - **Formulation Innovation:** The paper reframes text-to-image alignment as direct sampling from a reward-weighted distribution, moving away from common parameter fine-tuning approaches and offering a generic plug-and-play control mechanism. - **Practical estimator for the guidance:** The paper adopts a simple regression trick (Eq. 13) to approximate the conditional expectation and then converts it to a guidance gradient (Eq. 14), avoiding expensive backprop through the sampler. - **Stabilization for adversarial guidance:** The instability of naïve guidance with increasing strength is documented and addressed via a consistency regularizer merged in Eq. 16. - **Lightweight and fast:** The guidance net is only ~72 MB and reuses the reference model's VAE/tokenizer/text encoder. Combined with SDXL-Turbo, this enables effective **one-step** generation. - **Agnostic to Reward:** The method supports both differentiable and non-differentiable reward settings, with Table 4 in the appendix demonstrating applicability on GenEval with binary rewards. 1. **Analysis Depth of Regularization:** While the regularization is empirically justified and its effect visualized (see Figure 2), the theoretical underpinnings and limits of this regularization are not fully elucidated. What modes of artifact are suppressed, and does the regularization always guarantee avoidance of adversarial guidance? The practical selection of the regularization hyperparameter $\eta$ (Eq. 13 & 15) also remains ad hoc. 2. **Reward Dependence & Generality:** Although the proposed scheme is reward-agnostic in form, its empirical evaluation—especially in Table 2—is predominantly based on PickScore and similar human-preference proxies. It is unclear how robust the approach is to poorly calibrated, biased, or low-signal rewards. There is only a narrow demonstration on non-differentiable rewards in Table 4 (GenEval), which is limited in scope and size. 3. **Scope of generalization is narrow:** Experiments are concentrated on SDXL-Turbo; the paper asserts model-agnosticism and one-step benefits, but offers limited cross-backbone verification or stress tests on distribution shift. 4. **Hyperparameter Sensitivity:** The proposed method claims to "fix" the problem of carefully tuning the guidance strength, but practical recipes or robustness studies for the guidance parameter, regularization weight, or hyperparameter $\beta$ are lacking. 5. **Comparisons to very recent other alignment methods are light:** Table 2 includes Tweedie/Backprop and two finetuning methods (Diffusion-DPO, SPO), but a broader slate of strong alignment related methods (and best-practice configs) would better establish relative advantage. 6. **Not Strictly Finetuning-free:** Please refer to the precise definition of finetuning-free. The scenario described in this paper can at best be considered "no base-model fine-tuning". 1. **Regularization Mechanics:** Can the authors provide more intuition on how the proposed regularization term shapes the guidance network’s landscape? Are there scenarios or reward functions where this regularization might fail or even worsen adversarial behaviors? 2. **Sensitivity Analysis:** How does performance vary with η and β? Please provide curves (PickScore/HPSV2/ImageReward/Aesthetic vs. η, β) and report variance across seeds. 3. **Extension to Other Backbones:** Have you tested non-Turbo SDXL or SD 2.1 latent backbones, or text-conditional DiT variants(like Flux)? Are there empirical results or qualitative observations on data distribution shifts not covered by the current benchmarks? 4. **Robustness to Reward Misspecification:** Beyond GenEval's binary reward, how does the method fare under noisy, sparse, or biased rewards? Can the guidance network overfit to reward artifacts, and how would the regularizer respond? 5. **Comparisons with Other Alignment Related Work:** Where are the practical/theoretical boundaries vs. other plug-and-play or inference-time guidance alignment related methods? This will help substantiate the method's effectiveness and superiority. Fully AI-generated
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a finetuning-free alignment framework for text-to-image diffusion models that avoids the computational cost and limited generalization of existing RLHF- or DPO-based fine-tuning approaches. Instead of modifying model weights, the authors reinterpret preference alignment as sampling from a reward-weighted distribution, showing that the aligned score function can be decomposed into the original diffusion model score and an additional guidance term derived from a learned reward model. Experiments on text-to-image benchmarks demonstrate that the method achieves comparable or superior alignment quality to state-of-the-art fine-tuning methods. S1. The paper tackles a principled decomposition of the aligned score function into the diffusion model score, combined with the reward-based guidance term. The proposed method provides a conceptually elegant connection between preference learning and inference-time guidance, clarifying the relationship between RLHF/DPO-style methods and diffusion sampling. S2. The proposed method modifies neither the diffusion model parameters nor the text encoder, making it model-agnostic and straightforward to integrate with existing text-to-image pipelines. S3. The paper’s method achieves strong alignment performance while avoiding the heavy training overhead required by RLHF or DPO approaches. Combined with Stable Diffusion XL-Turbo, the method supports one-step T2I generation, making the overall pipeline to be suitable for practical usage and user-interactive generation scenarios. S4. The method consistently improves PickScore, HPS-v2, ImageReward, and Aesthetic score. Qualitative examples also show visually appealing improvements compared to the baselines. W1. The proposed method assumes that the forward diffusion process remains unchanged when aligning the model to human preferences, meaning the aligned distribution $q(x_t | x_0)$ is assumed to match the original pretrained model’s noising process $p(x_t | x_0)$. This assumption effectively preserves the base diffusion model’s denoising trajectory, which determines the global layout, and object composition of the generated image. As a result, while the proposed method is well-suited for adjusting aesthetic properties or making small semantic refinements, it may struggle to generate plausible image output conditioned on prompts requiring strong semantic correction, multi-object reasoning, or compositional control (e.g., enforcing spatial relations or specific attribute assignments). I was wondering if the paper’s method can also handle such generation tasks. W2. The guidance network outputs cannot be differentiated to denote where or which components of the image fail to match the textual specification. Consequently, the approach may struggle on prompts that involve explanations on multiple objects or spatial relations (*e.g.*, “to the left of,” “behind”). W3. The authors use Stable Diffusion XL-Turbo for experiments. However, recent works use the Transformer-based diffusion model, beyond U-Net based Stable Diffusion XL. Is it possible to apply the proposed method into the DiT-based model, such as Stable Diffusion v3 or even FLUX? W4. Because the diffusion backbone remains frozen, generated outputs closely reflect the inductive biases of the reward function. Is there any additional methods or strategy to alleviate the inductive biases of the given reward function, such as PickScore or Aesthetic Score? Please check the weakness section. Fully human-written
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a finetuning-free method that improves the alignment of text-to-image diffusion models. It frames the alignment as a sampling problem from a reward-weighted distribution. Specifically, this paper decomposes the scoring function with a guidance term and proposes a regularization technique to train the model. Experimental results on Pick-a-Pic dataset show the improvement of the proposed method over baseline studies.. This paper proposes a finetuning-free method that is efficient compared to finetune-based methods. The proposed regularization strategy stabilizes the guidance signal and improves the text-to-image diffusion models. 1. I have concerns regarding the evaluation of the proposed method. According to line 418, the evaluation is conducted *using 500 validation prompts from the validation unique split of Pick-a-Pic.* How are these prompts selected? Moreover, the baseline method SPO is evaluated on 4K prompts from Pick-a-Pic, which is eight times more than this method. 2. This paper evaluates its method based on SDXL-Turbo, which was released in 2023. Considering the rapid emergence of new models, SDXL-Turbo is kind of 'old' and cannot well support the effectiveness and generalization of the proposed method. How does the proposed method perform when generalized to recent models? 3. Figure 1 shows some visualization results, while the prompts are provided in the appendix. It is kind of difficult for me to find the improvement of the proposed method over baselines. It seems the baseline method already gets good enough results. 4. Is it expected to include the related work in Section 1.1 instead of Section 2? 5. The citations of the paper could be improved, such as line 107 > In (Liang et al., 2024), Liang et al. propose.... Please refer to the weaknesses. Fully human-written
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents an approach to aligning text-to-image diffusion models with human preferences. Instead of relying on computationally expensive fine-tuning of the base model (such as DPO), the authors propose a lightweight, plug-and-play guidance mechanism. The key contributions include a diagnosis of why naive guidance methods often fail, attributed to the adversarial nature of the guidance signal, and a solution that trains a small, regularized guidance network to provide a stable, artifact-free signal for diffusion models. - The finding that the adversarial nature of the guidance can lead to undesirable artifacts in the generated images is interesting. - The ablation studies effectively demonstrate the effectiveness of each proposed component. Despite the theoretical discussion in this work, there still lacks solid experiments to validate the proposed approach. - The method's effectiveness has not been validated across different diffusion model architectures, leaving its generalizability to other frameworks unclear. - The method's performance has not been demonstrated on other datasets, which limits claims of general applicability. - The paper lacks a sensitivity analysis for its newly introduced hyperparameters, making the method's robustness to parameter variations unclear. Besides, there are formatting issues in Lines 72–74: the manuscript appears to contain white text, e.g., “ted in one or very few steps, the two samples would only exhibit small differences in details. SPM allows us to capture such detail differences and guide the diffusion model.” This raises concerns about potential prompt injection targeting AI-assisted reviewers or, alternatively, author oversight in document preparation. Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next