ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 3.00 1732
Fully human-written 3 (75%) 6.00 3.67 1811
Total 4 (100%) 5.50 3.50 1791
Title Ratings Review Text EditLens Prediction
APEX: One-Step High-Resolution Image Synthesis Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes an improved strategy for distilling T2I diffusion models for few-step sampling. The core idea is to update the training loss in such a manner that the velocity field of the local predictions during sampling is smooth (enabling larger sampling steps) and introducing a discriminator-free adversarial objective which ensures high quality of the final samples. The evaluation on several datasets and using 1 and 2 NFE shows improved performance across various baselines and underlying models. The paper is well-motivated and presented. The introduced updates to the distillation training seem novel and the ablation studies do show their effectiveness. The evaluation is exhaustive, covering several baselines, models, and evaluation metrics. The results seem to be very strong, on par with many of the SOTA baselines while using fewer or the same number of NFE. Not having to rely on additional external models for, e.g., adversarial losses is a big plus. What is the cost of the distillation compared to other distillation approaches? While there does not seem to be a need for additional models besides the teacher (EMA?) model, it would be interesting to have a comparison of NFEs/memory requirements for one training step of this approach compared to other approaches such as adversarial distillation or consistency models. What are the limitations of this method? Are there any specific limitations during training or inference, or around other things related to this approach? How stable is the training? Are there any specific tricks necessary or is training generally stable? How does the model perform when you increase the number of NFE, e.g., to 8 or more steps? Does the performance increase further? Fully human-written
APEX: One-Step High-Resolution Image Synthesis Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. APEX is a new text-to-image synthesis method designed to solve the trilemma of high fidelity, inference efficiency, and training efficiency, which current adversarial (slow, unstable) and distillation (low quality) methods fail to achieve simultaneously. The core innovation is a self-condition-shifting adversarial mechanism that eliminates the external discriminator. This design ensures exceptional training stability and efficiency, making it ideal for LoRA tuning. Experimentally, APEX achieves SOTA high-fidelity synthesis with NFE=1, offering a 15.33x speedup over Qwen-Image 20B. It also improves GenEval scores on the 20B model in just 6 hours of LoRA tuning. 1. The paper successfully identifies and frames a critical and highly relevant challenge—the "trilemma" concerning fidelity, inference speed, and training stability in one-step generation—making a significant motivational contribution to the research community. 2. The work is presented with exceptional technical clarity; the novel mechanisms are thoroughly explained, and the overall structure and accompanying figures are highly effective, making the complex concepts very accessible to the reader. The training data utilized (ShareGPT-4o and BLIP-3o) appears to be highly specialized for the GenEval benchmark, which is also the primary benchmark where the paper achieves SOTA results. Given that the performance on the more general DPGBench is suboptimal, it raises a concern that the observed performance gain might be due to overfitting on the highly specialized training data rather than a generalizable technical improvement. Further experiments on a broader range of established benchmarks are necessary to confirm the robustness of the APEX method. No Lightly AI-edited
APEX: One-Step High-Resolution Image Synthesis Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces APEX, a method for efficient one-step text-to-image synthesis that addresses the fundamental trade-off between path integrability and endpoint fidelity in generative models. The core innovation lies in two complementary mechanisms: (1) a higher-order path self-consistency constraint that regularizes path curvature for numerical stability under large discretization steps, and (2) a discriminator-free self-condition-shifting adversarial mechanism that ensures high perceptual quality without the training instabilities of traditional GANs. APEX achieves state-of-the-art performance with NFE=1, demonstrating a 15.33x speedup over Qwen-Image 20B while maintaining comparable quality (GenEval score of 0.89 vs 0.87). - It is interesting to categorize many existing works into path integrability and endpoint fidelity. - APEX demonstrates impressive performance across multiple scales. - Detailed ablations on examining training steps, loss component weights, and hyperparameters, providing good understanding into what drives performance. - Unconvincing Evaluation: The paper claims they mainly evaluate the model by the GenEval score. However, the GenEval score does not measure image fidelity, leading to insufficient evaluation. The other used FID/clip score metrics are also less convincing to evaluate the modern text-to-image models. More importantly, the training is performed on Bilp-3o, which contains text-image pairs that were specifically generated by GenEval prompts. As far as I know, training on Bilp-3o can lead to a high GenEval score (Table 4 in the paper also shows the phenomenon), making the evaluation by GenEval unfair. - CTM [a], MeanFlow [b], and IMM [c] should all belong to methods that simultaneously consider path integrability and endpoint fidelity. However, this paper significantly lacks discussion and comparison regarding the aforementioned works. - The quality of the visualized samples looks mediocre and is not as impressive as the evaluation scores. - No visual comparison with recent strong baselines. - The mechanism of the so-called self-adversarial objective is unclear. And I do not see where the "adversarial" component is. [a] Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion [b] Mean Flows for One-step Generative Modeling [c] Inductive Moment Matching See Weaknesses. Fully human-written
APEX: One-Step High-Resolution Image Synthesis Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Current one-step generation methods either introduce training instability, high GPU memory costs, and slow convergence, or struggle to generate high-quality images. This paper proposes APEX, which uses standard diffusion/flow loss and DMD-like loss onto a single model. This manner leads to training effectiveness, efficiency and stability. Experiments on large-scale models such as Qwen-Image 20B validate the effectiveness of the proposed method. 1. The paper is well-written and the method is simple and easy to follow 2. Compared to methods based on f-divergence, this method does not need to train multiple models; compared to consistency models, it achieves high-quality generation 3. Experiments verify the ability of scaling to large models like Qwen-Image 20B, and achieving good performance Overall, this method improves upon f-divergence methods like DMD, VSD and SiD, towards training with only one model. The adversarial mechanism proposed in this paper is an existing approach in those methods. So the paper's contribution is the one-model feature. 1. In the endpoint objective, why is $x\_{\text{fake}}$ from a buffer of recent generations, rather than generating online but with stop gradient? 2. Why do you use $c_{\text{pert}} = Ac + b$ as the condition rather than other choices like an addtional flag? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next