ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 4554
Heavily AI-edited 1 (25%) 4.00 4.00 5255
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 5.00 3.50 4210
Total 4 (100%) 4.50 3.50 4557
Title Ratings Review Text EditLens Prediction
UEval: A Real-World Benchmark for Unified Multimodal Generation Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a new benchmark for evaluating multimodal generation. Per-sample curated rubric based LLM-as-judge evaluations are proposed as an alternative to traditional single-prompt LLM-as-judge techniques. Proprietary models reach 66% accuracy, while open source models perform worse, highlighting the benchmark's difficulty. 1. The proposed idea of generating a per-sample rubric for evaluating multimodal generation is novel and tries to solve the instability issues observed with standard single-prompt multimodal LLM-as-judge evaluations. 2. Both open-source and proprietary models score modestly on the benchmark, proving that this is a challenging benchmark for current state-of-the-art models in multimodal generation. 3. The methodology of generating rubrics with a strong multimodal LLM followed by human verification is sound. 4. Qualitative analysis and discussions of model behavior patterns is well presented. Alignment of proposed metric with human judgements along with alignment of rubric scoring with reference images are present, making the work well rounded. 1. The paper states that pairwise win-rate judging is unstable, and a single prompt LLM-as-judge evaluation overlooks sample specific differences. Some examples/studies demonstrating these deficiencies of current standard methods can improve the motivation. 2. Gemini-2.5-Pro, a proprietary model is used as the judge. This hurts reproducibility since API based models can change over time. It would be interesting to see results using leading open-weights multimodal models as the judge. 3. In open-ended tasks like “art” where there could be multiple correct answers, a discussion of whether the rubrics along with the judge used lead to fair evaluations is missing. Deeper analysis of the low human judgement alignment scores for such tasks would also be useful. 1. Can leading open-weights multimodal models replace Gemini-2.5-Pro as the judge? Do they show similar high correlation with human judgement? If not, what are the shortcomings of using them? 2. Could the authors provide a deeper analysis of how the judgements for “art” varies between LLMs and humans including fairness and biases of judge models in such open ended cases? Fully human-written
UEval: A Real-World Benchmark for Unified Multimodal Generation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces UEVAL, a new benchmark for evaluating unified multimodal models (i.e., models that can generate both images and text simultaneously). The benchmark consists of 1,000 expert-curated prompts from 8 real-world domains. Its core contribution is the introduction of an innovative, data-dependent rubric-based evaluation framework to replace traditional methods. Through an evaluation of 9 leading models, the paper reveals significant challenges posed by this task to current models and highlights the critical role of "reasoning" in complex multimodal generation. 1.The paper accurately identifies and addresses a significant gap in the current multimodal evaluation landscape: the lack of effective metrics for unified image-text generation. The proposed rubric-based evaluation method, which customizes scoring criteria for each prompt, represents a novel and important directional attempt to move beyond the paradigm of a simple "LLM-as-a-judge". 2.The paper conducts a broad evaluation of current mainstream models, covering 9 representative models from both open-source and proprietary domains . Its results clearly reveal the performance bottlenecks of existing unified multimodal models (especially open-source ones) on this task and experimentally quantify the effectiveness of "reasoning traces". These findings themselves are of significant reference value. 1.Questionable Task Formulation: The formulation of some prompts is mismatched with the visual medium required for the answer. For "Why" questions that require explaining internal structures or causality (e.g., the Statue of Liberty example), a single static image is inherently insufficient to provide a complete explanation. The paper's own analysis using a "reasoning trace" (Figure 5) inadvertently demonstrates that a truly sufficient answer requires a more complex format, such as a multi-step visual narrative (i.e., multiple images) accompanied by detailed explanations, rather than a single static image. This exposes an internal contradiction in some of the task designs. 2.Lack of Methodological Transparency: The process for rubric creation is opaque. The paper states that rubrics are refined by "human experts" but provides no manual or guidelines that directed them, nor does it specify the experts' qualifications. This compromises the credibility and reproducibility of the scoring criteria. Information on the human validation is insufficient. The validation was performed on only a 10% random sample of the data, which may be too small to draw robust conclusions for a 1,000-item benchmark. Furthermore, the professional backgrounds of the "human annotators" are not disclosed. The scatter plot in Figure 8, despite a moderate correlation ($\rho=0.76$), shows significant variance in individual cases, which, combined with the small sample size, weakens the claim of the automated framework's reliability. 3.Limitations of the Evaluation Framework: "One-Size-Fits-All" Evaluation: The paper applies a uniform LLM-judge framework across 8 vastly different domains, from "Art" tutorials to "Paper" diagrams15. For highly structured and logical domains like "Diagram," this generic approach may fail to capture specific evaluation dimensions and does not incorporate domain-specific, rule-based metrics. Evaluation Rigidity: For "guide" tasks that lack reference answers, the rigid requirement of a fixed number of steps may unfairly penalize creative or more effective answers that use a different structure. Missing Analysis Dimensions: The results analysis focuses heavily on image generation errors but lacks a concrete analysis of text-based errors and fails to adequately discuss the evaluation results for the core metric of image-text consistency. 1.Regarding Task Design: How did you ensure that the benchmark's prompts, particularly "Why" questions, are fundamentally answerable with a single image-text pair? Was there a screening criterion to determine the suitability of this format for tasks requiring process or internal structural explanations? 2.Regarding Methodological Transparency: Could you provide or elaborate on the manual/guidelines given to the "human experts" for refining the rubrics? Furthermore, what professional qualifications did these experts and the "human annotators" for the 10% validation possess, especially for specialized domains like "Paper" and "Diagram"? 3.Regarding the Evaluation Framework: Considering the significant differences between domains like "Art" and "Diagram," do you believe a single LLM-based framework is sufficient for fair and accurate evaluation across all tasks? Have you considered incorporating rule-based, domain-specific metrics for tasks with objective correctness criteria? Additionally, for the "guide" tasks, how does the scoring system account for high-quality responses that deviate from the prompted number of steps? 4.Regarding the Depth of Analysis: The analysis focuses on image-based failures. Could you provide some typical examples of text-based errors made by the models? Furthermore, how did the models perform specifically on the critical metric of image-text consistency, and can you share any quantitative analysis or illustrative case studies on this? Heavily AI-edited
UEval: A Real-World Benchmark for Unified Multimodal Generation Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces UEVAL, a challenging benchmark for evaluating unified multimodal generation models that produce both images and text in response to complex queries. Comprising 1,000 expert-curated prompts from 8 real-world domains (e.g., space, textbook, diagram, art), UEVAL requires models to generate interleaved multimodal outputs, addressing gaps in existing evaluations focused on VQA or text-to-image tasks. The authors propose a rubric-based scoring system where LLMs generate initial rubrics based on reference answers, refined by humans for reliability, resulting in 8.1K criteria. Experiments on 9 models reveal high difficulty, with GPT-5-Thinking scoring 66.6/100 and the best open-source model at 22.4/100, highlighting the benefits of reasoning traces in improving generation quality. Contributions include the benchmark dataset, rubric framework, and insights into reasoning's role in multimodal tasks. The paper demonstrates strong originality by formulating a novel benchmark for unified multimodal generation, creatively combining real-world scenarios with data-dependent rubrics to evaluate interleaved image-text outputs, which extends beyond traditional VQA or T2I paradigms and addresses a clear gap in assessing complex multimodal reasoning. In terms of quality, the benchmark's construction is rigorous, with 1,000 diverse prompts sourced from expert curation and validated references, supported by comprehensive experiments on both proprietary and open-source models that provide actionable insights, such as the impact of reasoning traces on non-reasoning models. Clarity is a highlight, as the paper is well-structured with intuitive figures (e.g., Figure 1 illustrating task distributions) and detailed appendices, making the rubric generation and evaluation pipelines easy to follow. Finally, the significance is evident in its potential to drive advancements in unified models, offering a scalable, reproducible evaluation framework that emphasizes reasoning's importance, with public release of data and code enhancing community impact. One key weakness is the benchmark's limited scale and diversity; with only 1,000 prompts across 8 domains, it may not fully capture the breadth of real-world multimodal tasks, such as those involving dynamic video inputs or non-English languages—to improve, the authors could expand the dataset by incorporating user-generated prompts from broader sources like social media datasets and validate cross-cultural applicability. Another issue is the heavy reliance on proprietary LLMs (e.g., Gemini-2.5-Pro) for rubric generation and judging, which introduces potential biases and reproducibility challenges; a constructive step would be to include ablation studies using open-source alternatives and quantify inter-model agreement metrics beyond the reported Pearson correlation. Additionally, the analysis of reasoning benefits is preliminary and lacks deeper mechanistic insights, such as why open-source models like BAGEL fail to improve with transferred traces—future work could address this by conducting fine-grained error analyses or probing model internals to identify specific failure modes in multimodal token prediction. 1. The paper emphasizes reasoning traces' benefits for proprietary models but not open-source ones—could you elaborate on potential architectural differences (e.g., tokenization strategies or multimodal fusion mechanisms in BAGEL vs. GPT-5) that might explain this disparity, and provide any additional experiments or hypotheses that could clarify if this is due to training data quality or model capacity? A detailed response here could strengthen the claim about reasoning's universality in unified models. 2. Your rubric-based evaluation shows strong human-LLM alignment (Pearson ρ=0.76), but how robust is this to variations in the judge model? For instance, if you replaced Gemini-2.5-Pro with an open-source alternative like LLaVA, would the scores change significantly, and what metrics (e.g., Cohen's kappa) did you compute to assess this? Addressing this could alleviate concerns about evaluation bias and enhance the framework's generalizability. 3. UEVAL excludes input images in prompts, focusing solely on text-based queries—how would incorporating visual inputs (e.g., for tasks like diagram completion or editing) affect model performance, and do you have preliminary data or plans to extend the benchmark in this direction? This could address a noted limitation and potentially reveal new insights into multimodal reasoning chains. Fully AI-generated
UEval: A Real-World Benchmark for Unified Multimodal Generation Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. **Paper Summary** The paper introduces a new multimodal benchmark that specifically focuses on tasks that require the generation of images and text, for a query. The authors position their contributions as filling an existing gap in the realm of multimodal benchmarks that are typically of one of the two forms; i.e. generating text conditioned on an image, or generating an image given a text description. The authors bootstrap strong closed source multitmodal models to generate rubrics based on reference image/text pairs for the given query. A set of human annotators then refine these rubrics further, and finally an LLM is again used to score the outputs with the reworked rubrics. The authors state that current SOTA models, like GPT-5-Thinking, show average performance, indicating the difficulty in solving the benchmark. Further, the authors suggest that reasoning based models are particularly good at such problems, relative to non-reasoning ones, and demonstrate the value of adding in reasoning traces into the prompt to boost performance. The paper attempts to cover multiple diverse domains of interest, broadly representative of the spectrum of questions that a multimodal model should be evaluated on. The problem definition and the identified gap in existing multimodal benchmarks are well-motivated and relevant. The approach appears useful to the broader goal of improving multimodal benchmarking. Figure 4 clearly explains the benchmark setup and illustrates the overall pipeline effectively. The work identifies an important gap in current multimodal model evaluation and proposes a benchmark that meaningfully targets that blind spot. The benchmark is shown to be challenging for current frontier models (e.g., GPT-5-Thinking), which supports its validity as a difficult and discriminative test. • The authors fail to clearly explain where these datasets are obtained from. For example, the questions from “Space” are indicated to be from “real-world questions from online Q&A platforms”. Better sources of rigorous question banks could be obtained from existing peer-reviewed datasets such as Astro-QA (Nature 2025). • In the textbook section, do the authors generate the additional questions with GPT5 for the same existing images in the TQA dataset, or are these some other diagrams from additional sources? • The authors mention that in the guide task that there is no reference text/image which makes the rubric construction procecure relatively more ambiguous given that the multimodal model has no anchor point for generating the rubric. It is understandable that some questions may not have reference answers, however, the authors should explain in more detail how this case is handled separately from the other categories. • There is little to no detail on how the human annotators were chosen, and any designs on the experimental study highlighting the number of human annotators, variance within annotations, selection criteria for annotators etc. • There seem to be very minimal details/ablations on the exact models being used to generate the rubric. Only gemini 2.5 pro is mentioned as the rubric judge model. This raises concerns about effect of model choices in both the stages of the pipeline, as well as better understanding the effects of these choices. • A fair amount of space/text is used to demonstrate outputs of multiple different models, as well as the complete descriptions of the datasets, while little time is spent on Section 2.2, which is arguably the most significant aspect of the paper. It would be good to significantly expand on the exact setup. Are reference image/text and query pairs sourced together from online datasets, or are the queries generated conditioned on sampled reference images (in certain cases, like in line 245, it appears that GPT5 is generating the query given a single reference image, whereas the TQA dataset has both the query and the reference answers as part of the dataset?). • In line 297, the statement “rubrics are generated solely from the question prompt” is vague, and requires more explanation, as the guide tasks form a major part of the benchmark. For instance, a few example rubrics on some of the guide tasks would help the reviewers better understand exactly what sort of rubrics are generated by the LLMs of choice in this study. • In line 300, 301, it is also crucial to know what sort of human annotator modifications are made in practice, and why. • There is a lack of clarity on the role and setup of human annotations, which makes it difficult to assess what degree of human validation went into developing this benchmark. • Different datatsets have different forms of rubric generation, some don’t have a reference answer at all (which makes it difficult to assess what a reasonable rubric should be), while others either have a given query, reference answer pair, or the query is generated from a sampled reference image. Are the answers to the questions sourced from social media generated by an LLM, or are they obtained from online answers directly, implying direct scraping of (query, image_text, image_answer) tuples? If so, how is the quality of these datasets or tuples assessed? Do the authors generate additional questions with GPT-5 for the same images in the TQA dataset, or are the diagrams drawn from other sources? If so, does each image in the TQA dataset have multiple corresponding questions, including synthetic ones? What does a typical rubric look like in the guide task? How are rubrics constructed when there is no reference text or image available? Were rubric modifications made by a single annotator or aggregated from multiple annotators? Are there ablations comparing different models for rubric generation and judging? How much variance arises when different models are used for rubric generation and scoring? This is necessary to assess how stable and fair such a benchmark setup would be in practice, and how sensitive it is to model choices. What exactly is meant by “rubrics are generated solely from the question prompt” [line 297]? What kinds of modifications are typically made by human annotators, and for what reasons? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next