ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 3 (100%) 5.33 4.00 5324
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 0 (0%) N/A N/A N/A
Total 3 (100%) 5.33 4.00 5324
Title Ratings Review Text EditLens Prediction
Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Story-Iter, a training-free iterative paradigm for long story visualization that addresses the limitations of existing auto-regressive and reference-image paradigms by using full-length frames from the previous external iteration as references and integrating a Global Reference Cross-Attention (GRCA) module to model global semantic consistency. 1. **Novel and Targeted Iterative Paradigm**: The proposed external iterative framework directly addresses the core limitations of existing AR and RI paradigms in long story visualization. By using full-length frames from the previous iteration as references (instead of fixed or limited frames), it effectively mitigates error accumulation and global consistency loss, a long-standing challenge in the field . 2. **Efficient and Lightweight GRCA Module**: The Global Reference Cross-Attention (GRCA) module, based on CLIP global embeddings, enables global semantic modeling while significantly reducing computational costs. 3. **Rigorous Experimental Validation**: Experiments on both regular-length (StorySalon) and long-story (100-frame) benchmarks show consistent SOTA performance: it outperforms StoryGen and surpasses StoryDiffusion for long stories. Human evaluations further confirm its superiority in character interaction and content consistency . 4. **Practical Training-Free Design**: Story-Iter requires no retraining, leveraging pre-trained Stable Diffusion and CLIP weights. Its plug-and-play GRCA and linear weighting strategy ensure easy integration into existing pipelines, with minimal hyperparameter tuning . 5. **Strong Generalization to Multi-Style Generation**: Beyond standard realistic style, it successfully generates long stories in comic, film, and monochrome styles, with CLIP-T scores remaining above 0.30. This demonstrates robust adaptability to diverse visual requirements . 1. **Computational Efficiency for Extremely Long Stories**: While more efficient than baselines, generating a 100-frame 1024×1024 story still incurs 4.30 PFLOPs per iteration. If the paper were to include experiments based on a distilled single-step diffusion model to demonstrate the universality of its method, it would more fully prove the superiority of its method. 2. **Tradeoff Between Consistency and Text Alignment**: Longer iterations (≥10) slightly weaken text-image alignment (CLIP-T drops from 0.330 to 0.297), and the current linear weighting strategy lacks content-aware adaptability for stories with complex plot shifts . 3. **Limited Diversity in Local Frames**: Despite global consistency, certain consecutive frames exhibit limited diversity (e.g., similar character poses). The framework lacks explicit controls (e.g., pose/layout constraints) to enhance frame-wise variation . Please refer to the detailed points I raised in the "Weakness" section and respond to each numbered item in your rebuttal with clarifications. Fully AI-generated
Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Story-Iter, a training-free iterative paradigm for long story visualization. Unlike existing autoregressive or fixed-reference methods, Story-Iter progressively refines each frame across external iterations by referencing all frames from the previous round. The key contribution is a Global Reference Cross-Attention (GRCA) module that uses compact global embeddings instead of high-dimensional latent features, enabling scalable processing of sequences up to 100 frames. A linear weighting strategy balances visual consistency and text alignment across iterations. Experiments show state-of-the-art results: over StoryGen and StoryDiffusion, with showing some good fine-grained interaction generation and detail preservation. 1) Novel iterative paradigm: External iterations that refine all frames by referencing the complete previous sequence, effectively addressing error accumulation in autoregressive methods and fixed-reference limitations 2) Scalable GRCA module: Uses compact global embeddings rather than high-dimensional latent features, enabling 100+ frame stories with manageable memory (19GB vs 40GB for StoryDiffusion) 3) Strong empirical validation: Consistent improvements across multiple metrics and benchmarks, with comprehensive ablations, human evaluation, and comparisons to diverse baselines 4) Training-free and practical: Plug-and-play approach reusing IP-Adapter weights, no dataset-specific training required 5) Thorough experimental design: Addresses both regular-length (StorySalon) and long stories (up to 100 frames), with qualitative evidence of improved fine-grained interactions 1) Base model (IP-Adapter) outdated. Even though we are in the era of Flux/SD3.5, use of IP-Adapter-based models with SD 1.5 (I am not sure whether it is 1.5 - I checked the code and you only provided StoryIter-XL version of the code) or SD-XL based implementation feels a bit outdated. I guess you used it to implement based on IP-Adapter, but when we consider Flux.1.Kontext or Nano-banana-like reference-image based image editing models, I think these can also achieve good result in generating good result on such scenarios, so, I feel that current paper is outdated. > I suggest adding the SOTA editing models (Flux.1.Kontext / Nano-Banana) into comparison. Since these model show good consistency, I think they can serve as powerful baselines. If your model performs somewhat close to these baselines, I think it will demonstrate the superiority of your method. 2) Question on (1) leads to questionable motivation. If we can achieve good result with Flux.1.Kontext or Nano-banana with good consistent result, why should we use this model to achieve good consistency? I see that the point of using Story-Iter framework is to achieve good quality of consistency with text alignment in iterative generation scenario, but if the SOTA models show good quality, I cannot understand this paper's motivation. > I want the motivation of the paper further strengthened. 3) Question on (1) and (2) leads to the question on the work's high computational cost: 10 external iterations for 100 frames requires 4.30 PFLOPs/25 minutes per iteration; total generation time is substantial despite claims of reducibility. Compared to those SOTA models, is it more efficient? > How does performance scale with 3-5 iterations versus 10? What is the minimum number of iterations needed for acceptable quality-efficiency trade-off? Also I think adding comparison on inference FLOPs with those SOTA editing models is necessary. 4) Text-alignment degradation: Table 8 shows CLIP-T drops from 0.330 to 0.297 as iterations increase (from 1 to 15), indicating the method overfits to visual consistency at the expense of text fidelity > What causes the text-alignment degradation (CLIP-T drop) with more iterations? Can this be addressed with alternative weighting strategies or architectures? 5) Limited diversity: Authors acknowledge "local frames still exhibit diversity limitations"; the iterative refinement may overly homogenize the visual sequence > How does the method handle significant content changes across frames (scene transitions, new character introductions, location changes) with diverse scenarios? When you look at the result of Story-Iter, they look so-alike in terms of colors and backgrounds. We can say it is consistent, but it means less diverse, so I want some more explanations on how it demonstrates on complex/dynamic changing scenarios. 6) Incremental technical contribution: GRCA is essentially IP-Adapter's cross-attention adapted with global embeddings; the iterative paradigm, while effective, is a relatively straightforward extension. Moreover, doing the explicitly set iterations (10 for the paper) is somewhat naive. > Even though the iterating over the result itself is somewhat interesting, I think there should be some way to stop at appropriate iteration if the desired quality is achieved. 7) Limited long-story evaluation: Custom benchmark has only 20 cases (10×50, 10×100 frames) generated by GPT-4o, which is insufficient for robust evaluation and may introduce generation biases > It would be really good if we can evaluate on excessive dataset constructed since doing only on 20 cases can be a bit small. Written in the weakness.I know that (6) and (7) are difficult to answer during rebuttal period. Please focus on answering questions of (1) to (5). I will adjust my score after checking on the rebuttal. Fully AI-generated
Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Story-Iter, a novel, training-free iterative paradigm for long story visualization. The key problem addressed is maintaining semantic consistency and accurately rendering interactions in long image sequences generated from text prompts. Existing methods, primarily Auto-Regressive (AR) and Reference-Image (RI) paradigms, suffer from error accumulation or an inability to adapt to the full story context. Story-Iter proposes an external iterative loop that refines the entire story sequence in each pass. In iteration `i`, the generation of each frame is conditioned on the text prompt *and* all generated frames from the previous iteration `i-1`. This is enabled by a plug-and-play Global Reference Cross-Attention (GRCA) module, which uses the global CLIP embeddings of all previous frames as context. The influence of this visual context is controlled by a linearly increasing weight $\\lambda_i$ across iterations, balancing text alignment and visual consistency. The authors demonstrate through extensive experiments on both regular-length (StorySalon) and a new long-story benchmark that Story-Iter achieves state-of-the-art performance in generating coherent and high-quality story visualizations, particularly for sequences up to 100 frames. 1. **Novel and Effective Paradigm:** The core strength of the paper is the proposal of an iterative refinement paradigm for story visualization. This is a genuinely new approach in this domain that effectively addresses the key challenge of long-range consistency by allowing the model to gain a global view of the entire story and refine it over multiple passes. It elegantly sidesteps the error accumulation of AR models and the rigidity of RI models. 2. **State-of-the-Art Performance:** The method achieves impressive empirical results, outperforming strong recent baselines like StoryDiffusion and StoryGen on benchmarks for both regular and long-story generation. The qualitative results, especially for long stories (up to 100 frames), are particularly compelling and clearly demonstrate superior character consistency and interaction modeling. 3. **Excellent Presentation and Clarity:** The paper is extremely well-written, with clear explanations and high-quality figures that make the concepts easy to understand. The comprehensive appendix and extensive visual examples further bolster the quality of the submission. 4. **Practicality and Accessibility:** By designing Story-Iter as a training-free, plug-and-play module that leverages pre-trained IP-Adapter weights, the authors have made their method easy to implement and use. This significantly lowers the barrier to entry for other researchers to verify, use, and extend this work. 1. **Prohibitive Computational Cost:** The most significant weakness is the method's computational expense. An $L$-iteration process results in an $L$-fold increase in generation time compared to single-pass methods. The default of 10 iterations makes the method an order of magnitude slower than its competitors. This is a major practical limitation that is understated in the main paper and largely relegated to the appendix. While the authors suggest using acceleration techniques, no experiments are presented to show this is feasible without sacrificing quality. This trade-off between quality and compute needs to be a more central part of the discussion. 2. **Limited Technical Novelty of GRCA:** The GRCA module is a direct reuse of the cross-attention mechanism and weights from IP-Adapter. While its application is novel, the module itself does not introduce a new architectural or algorithmic concept. The paper's contribution lies almost entirely in the iterative paradigm, not in the attention mechanism itself. This should be stated more explicitly to manage expectations about the technical depth of the proposed module. 3. **Insufficient Justification of the Iterative Process:** The paper provides an intuitive but superficial explanation for why the iterative process works. The argument for convergence is based on a t-SNE visualization (Fig. 4), which is insufficient. A more formal discussion on the dynamics of this process would be beneficial. For example, what prevents the process from "over-fitting" to the initial generation and losing diversity, or collapsing to a mode where all images are nearly identical? The linear weighting $\\lambda_i$ is a heuristic to prevent this, but its behavior is not deeply analyzed. 4. **Incomplete Ablation Studies:** The ablation study, while good, could be more comprehensive. The `w/o GRCA` ablation replaces global attention with a per-frame self-refinement, which is a very different task. A more informative ablation would be to compare GRCA against simpler aggregation strategies within the same iterative paradigm. For instance: * What if only a sliding window of $k$ previous frames is used as reference? * What if only a random subset of frames is used? * What if the global embeddings `c_1...B` are simply mean-pooled into a single context vector? This would help to truly justify the need for attending to *all* frames in the sequence. 1. **Computational Cost:** Could you please add a table to the main paper that explicitly compares the end-to-end inference time (or total GFLOPs) and peak VRAM usage of Story-Iter (10 iterations) against StoryDiffusion and StoryGen for generating a 100-frame story? This would provide a clearer picture of the quality-compute trade-off. 2. **Dynamics of Iteration:** The linearly increasing weight $\\lambda_i$ is crucial for balancing text-alignment and consistency. Is there a risk that in later iterations, the strong visual conditioning from `GRCA` overpowers the text prompt $T_k$, leading to a failure to generate new objects or actions described in the text? The slight drop in CLIP-T in Table 8 suggests this might be the case. How does the model handle the introduction of entirely new characters or settings late in the story across iterations? 3. **Metric Definition:** For reproducibility and clarity, could you please add formal definitions of `aCCS` and `aFID` to the appendix? Specifically, how are character bounding boxes obtained for `aCCS` calculation, and is it robust to detection failures? For `aFID`, is the distance computed between all pairs of images, consecutive images, or against a reference set? 4. **Sensitivity to Hyperparameters:** The linear schedule for $\\lambda_i$ is defined by $\\lambda_1$ and $\\lambda_L$. How sensitive is the final result to these start/end points? Does the optimal schedule depend on the story length $B$ or the diversity of the content within the story? For example, would a story with many scene changes require a different schedule than one with a fixed background? 5. **Necessity of Full Context:** The GRCA module attends to all $B$ frames from the previous iteration. Have you investigated whether this is necessary? Could a similar level of performance be achieved with a more efficient context, such as a fixed-size window of $k$ surrounding frames, or a stochastically sampled subset of frames? This could be a path to mitigating the high computational cost. 6. **Potential for Training:** The training-free aspect is a great feature for accessibility. However, have you considered the possibility that fine-tuning the IP-Adapter weights (or a dedicated GRCA module) on the task of iterative story refinement could lead to even better performance or, more importantly, reduce the number of required iterations, thus improving efficiency? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next