ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 3.50 3548
Heavily AI-edited 1 (25%) 4.00 4.00 3524
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 2.00 3.00 2115
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 4.00 3.50 3184
Title Ratings Review Text EditLens Prediction
IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes IUT Plug, a method that sits between VLM and T2I models when generating images based on VLM outputs. The method adds additional context from the image to the VLM text output to improve T2I image generation. The approach is benchmarked on a custom dataset that includes expert annotations. - The method is relatively straightforward to employ given model access. **W1:** The contextual information extracted by the tool should arguably be easily recognized by a well-performing text-to-image generative model with a good descriptive text output from the VLM. The paper does not demonstrate that current T2I models fail to capture this information without the proposed method. **W2:** The tool merely extracts information from the image and appends it to the T2I input prompt. That such additional context does not degrade model performance and can improve results in cases where original image information is lost appears trivial, as providing more relevant information naturally supports image generation. **W3:** The practical relevance of the setting is questionable. It remains unclear in what scenarios VLMs and T2I models are deployed sequentially as separate components. Current understanding within the AI community suggests that VLMs with image generation capabilities (such as GPT-4o) employ unified architectures combining autoregressive and diffusion generation, rather than two discrete models. **W4:** The manuscript is difficult to read and follow. The narrative structure and motivations are often unclear. Abbreviations are introduced multiple times (e.g., T2I appears at least three times), and numerous spelling errors are present (e.g., line 314 "(" ). **W5:** The claim that the plug-in constitutes a "World model" (Line 066) is not justified. The method lacks inherent properties of world models, as it merely deconstructs the image into text. **W6:** The exact method by which concepts are extracted from the image is not explained. **W7:** It is also not described how the benchmark is constructed and what the “expert annotations” really are. See weaknesses W1 to W7 Lightly AI-edited
IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. IUT-Plug is a novel plug-in module designed to tackle multimodal context drift in interleaved image-text generation, where models often fail to maintain logical consistency, object identities, and stylistic coherence across combined visual and textual outputs. The approach introduces an Image Understanding Tree – a hierarchical symbolic representation of the visual scene – which is integrated into existing Vision-Language Model pipelines to provide explicit structured reasoning and enforce consistency constraints on the generation process. 1. The paper introduces a lightweight, modular plug-in that can be attached to existing VLM+T2I pipelines without retraining or architectural changes. 2. The authors propose a dynamic, semantics-focused evaluation framework as a key contribution. Instead of relying on coarse metrics like FID or CLIP score, they generate custom consistency questions for each test case and use a fine-tuned VLM to score yes/no answers, achieving much higher agreement with human evaluators (87.6% vs ~55% for static baselines) in judging style, logic, and entity consistency. 3. The paper provides strong experimental evidence that IUT-Plug yields tangible improvements on challenging interleaved generation tasks. Results on both the new 3,000-pair benchmark and public datasets. 1. The proposed solution introduces a complex pipeline with multiple large-scale components (a scene parser, an LLM prompt generator, a text-to-image model, and a custom evaluator), which could hinder reproducibility and practical deployment. 2. IUT-Plug’s performance is contingent on the quality of its visual scene understanding – an error in the IUT extraction (e.g. a missed or misidentified object or relationship) could propagate incorrect constraints to the generation stage. 3. The experimental results, though promising on consistency, focus mainly on the proposed criteria (style/logic/entity consistency) and QA accuracy. There is less discussion on other aspects of output quality (e.g. image realism or linguistic richness) and no user study to confirm human preference. 1. How well would IUT-Plug generalize to other models or domains beyond those tested? For example, could the plug-in be readily used with a different VLM (like GPT-4’s vision capabilities or upcoming multimodal models) and on tasks such as interactive storytelling or dialog, and if so, would any modifications be needed to maintain its effectiveness? 2. What is the runtime overhead of inserting the IUT-Plug into the pipeline? 3. The use of GPT-5 for criterion generation is ambitious – did the authors consider using GPT-4 or an open-source model for this, and what was the impact on criterion quality? Fully AI-generated
IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the critical issue of multimodal context drift in interleaved image-text generation, where vision-language models (VLMs) fail to maintain consistency in logic, entity identity, and style across multiple turns. The authors propose IUT-Plug, a training-free, plug-in module that extracts a symbolic representation of the visual scene, called an Image Understanding Tree (IUT), to explicitly guide the generation process. To measure the method's effectiveness, the paper also introduces a novel dynamic evaluation framework that uses large models to generate and score fine-grained consistency criteria. Experiments show that IUT-Plug improves consistency scores across several VLM and text-to-image model combinations. The paper makes a valuable contribution by tackling the significant and well-defined problem of context drift in generative VLMs. The proposed IUT-Plug, a neuro-symbolic and training-free module, is a novel and practical approach. Furthermore, the introduction of a new evaluation framework that moves beyond standard metrics to assess semantic consistency is a commendable effort that can benefit the broader research community. 1. Benchmark Validity and Accessibility: The evaluation relies entirely on a new, in-house benchmark of 3,000 samples. This raises concerns about potential dataset bias, where the collected data might favor structured, symbolic reasoning and thus unfairly advantage the proposed method. For the work to be verifiable and impactful, several key questions must be addressed: - Will the benchmark be made public? - What is the estimated API cost and procedure for running one full evaluation, given its reliance on proprietary models like GPT-5? 2. Insufficient Comparison to Simpler Baselines: The primary comparison is between using IUT-Plug and not using it. However, the paper fails to compare against simpler, training-free alternatives that could also enhance consistency. For example, a strong baseline would be to use advanced prompt engineering, such as instructing the VLM to first generate a structured textual description of the scene (entities, attributes, relationships) and then use that description to form the final prompt for the text-to-image model. Without comparing against such methods, the added complexity of the IUT framework is not fully justified over more straightforward prompting strategies. 3. Unclear Robustness of the IUT Representation: The paper's examples feature scenes with clear, discrete objects. The scalability and robustness of the IUT structure for more complex or ambiguous scenarios are not discussed. It is unclear how the method would handle: - Abstract concepts (e.g., generating an image conveying "a sense of loneliness"). - Highly cluttered scenes with many interacting objects. - Visually ambiguous elements that defy simple entity-attribute-relation decomposition. This leaves the generalizability of the approach in question. 4. Limited Qualitative Analysis: The visual results provided are limited and appear to be success cases. A more comprehensive qualitative analysis should include: - A wider variety of generated examples to showcase performance across different domains. - Crucially, a discussion of failure cases to provide a balanced understanding of the method's limitations. - Illustrative examples from the benchmark itself to help the reader understand the nature and difficulty of the evaluation tasks. Please address the concerns listed in the weakness. Heavily AI-edited
IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses a critical and well-recognized problem in modern Vision Language Models (VLMs): multimodal context drift during interleaved image-text generation. The proposed solution, IUT-Plug, aims to enhance consistency in logic, entity identity, and visual style by introducing an Image Understanding Tree (IUT)—a hierarchical symbolic structure representing visual scene elements and their relationships. This IUT is dynamically updated and used to guide both textual responses and text-to-image (T2I) generation. The authors also introduce a dynamic evaluation framework employing LLMs to generate task-specific criteria, which is then scored by a fine-tuned VLM. 1. The paper correctly identifies and targets multimodal context drift, a significant limitation of current state-of-the-art VLMs in multi-turn interactions. The use of an explicit, structured symbolic representation (IUT) to maintain consistency across modalities and turns is a conceptually appealing approach, aligning with principles from neuro-symbolic AI. 2. The proposed lightweight and model-agnostic plug-in architecture has the potential to offer a practical way to enhance existing VLM-T2I pipelines without requiring extensive retraining of large foundation models. 3. The dynamic evaluation protocol, using LLMs to generate task-specific criteria and a fine-tuned VLM for scoring, is an interesting methodological contribution for more nuanced and human-aligned assessment of multimodal consistency. The reported 87.6% agreement with human judgment is notable, provided the underlying LLM for criteria generation is verifiable. 1. The repeated claim of using "GPT-5" for dynamic criterion generation undermines the reproducibility and scientific credibility of the entire evaluation framework. 2. The paper provides no specific technical details on how the Image Understanding Tree (IUT) is constructed from an input image. This is a black box at the core of the proposed method, making it impossible to understand, reproduce, or critically evaluate the technical contribution. 3. While IUT-Plug shows relative improvements (e.g., 7.2 to 10.5 percentage points), the absolute consistency scores remain quite low (often in the 30-40% range even with IUT-Plug). This suggests that the models still frequently fail to maintain consistency, and the "alleviation" of context drift is partial at best. This should be discussed more transparently. 4. The paper does not adequately discuss the inherent expressiveness or limitations of the IUT representation for complex logical reasoning, abstract concepts, or handling ambiguity in visual scenes. 5. The claim that IUT-Plug is "lightweight" is not supported by any quantitative data (e.g., computational overhead, inference time, memory footprint). Adding an additional processing pipeline will inevitably introduce some overhead. 6. While scene graphs are mentioned, the paper does not sufficiently differentiate IUTs from existing scene graph generation and manipulation techniques, especially regarding dynamic updates. The assertion that existing scene graph methods "do not support updates across interactions" needs stronger evidence, as dynamic scene graphs are an active area of research. 1. Could the authors provide a detailed technical description of the "dynamic IUT-Plug extraction module"? What specific computer vision models or techniques are used to parse visual scenes into objects, attributes, and relationships? What is the pipeline for this extraction? 2. Please define "Situational Analysis" and "Project-based Learning" as used in Table 1. What do these benchmarks entail, and how are their scores calculated? 3. Given that even with IUT-Plug, consistency scores often remain below 50%, could the authors elaborate on the practical implications of these results? What level of consistency is considered "acceptable" for real-world interleaved generation tasks, and what are the next steps to further improve these scores? 4. Can the authors provide quantitative metrics (e.g., average inference time increase, memory usage) for the IUT-Plug module when integrated into a VLM-T2I pipeline? This would support the claim of being "lightweight." 5. How does the IUT handle complex logical inferences, abstract concepts, or scenarios with significant ambiguity? What are the known failure modes of the IUT extraction or representation itself? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next