ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 5.00 4501
Lightly AI-edited 2 (50%) 3.00 4.50 3616
Fully human-written 1 (25%) 2.00 5.00 3600
Total 4 (100%) 3.00 4.75 3833
Title Ratings Review Text EditLens Prediction
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces CHIMERA, a benchmark designed to test whether vision-language models actually understand diagrams or just rely on shortcuts. It includes thousands of Wikipedia diagrams annotated with questions that probe different levels of understanding, using Gemini model. The authors use the same model to filter out samples that are not accurate and visually grounded. By evaluating 15 major models, the authors conclude that strong performance often comes from exploiting memorized visuals, stored knowledge, or linguistic patterns rather than true comprehension. - This paper tries to present a benchmark that has an important goal: Diagnosing whether models perform true visual understanding and reasoning, or they use other cues, called shortcuts by the authors. This is a benchmark that is relevant and needed for performance evaluation, however, the process of building the dataset has fundamental flaws that undermines the value of it. I will elaborate more in weaknesses. - The presentation of the paper and the results is good and clear. - The authors observation of decreased performance on more complex tasks (Lines 370-377) is intriguing and supported by previous observations [1] where the same observation is called "easier-worse anomaly". [1] Visual Graph Arena, ICML 2025 1) The most important weakness of the current work, is the use of LLMs for generating the annotations and answers. Although automation is used in literature for benchmark curation, the current work's goals specifically requires use of no model in generating the ground truth responses. One major flaw is when the authors use the Gemini model for "discarding the examples if questions can be answered without the image" (Line 240). This process is removing the very biases that the models may have because of being exposed to the data in their training, which is the goal of paper to assess. While time and resource consuming, the benchmark for this specific goal must be carefully created by human annotators. As a sidenote, one does not need to have 7500 samples for this benchmark.1000 samples, but high quality, would suffice for the goal. 2) The hypothesis in Line 324 is wrong, and hence the experiment based on that to evaluate visual memorisation is so. Why do the authors expect models perform better on the semantic modality, which is actually generated by another model using the original image! If anything, the image must contain more information helping the models to answer the questions. 3) Another issue is the use of wikipedia diagrams for the benchmark. Although the authors try to justify this (Line 88), it is notable that the 3rd shortcut presented (Clever-Hans shortcut) which is concluded to be the most prominent one, actually requires a benchmark of out of distribution diagrams to be fairly evaluated. Moreover, if a benchmark is generated to be out of distribution by design (for instance, by newly generated plots and diagrams from textual/tabular data), it clearly will be a sound benchmark for evaluating memorisation. (Newly generated diagrms -> Not exposed to during training -> suitable to test memorisation shortcuts) 1) Since the dataset is not provided, I ask the authors to add a few more examples of their samples to the appendix (Like Figure 1). 2) Did you see the other anomaly observed by [1], the middle score anomaly, in any of the tasks? For instance, the tasks involving detecting the colors (like the example in Figure 8) 3) Since the benchmark contains a train and test split, have you trained (fine-tuned) any models on the training set? If no, why? Fully human-written
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces CHIMERA, a benchmark for diagnosing shortcut learning in VLMs on diagram understanding tasks. The dataset provides 7,500 Wikipedia-sourced diagrams, automatically annotated with semantic triples and four levels of multiple-choice questions. The authors evaluate 15 open-source VLMs and claim their performance is largely driven by three types of shortcuts: visual memorization, knowledge recall, and Clever-Hans. While the work addresses a relevant problem, significant methodological and conceptual limitations undermine its contributions. - The paper states a good question of investigating why VLMs succeed, focusing on shortcut behaviors rather than just performance metrics. - The four-level task hierarchy (ER → RU → KG → VR) provides a good framework for analyzing different aspects of diagram comprehension. - Evaluating across three modalities (visual diagrams, semantic graphs, and textual descriptions) is a well-structured approach for isolating modality-specific biases. - The dataset includes 7,500 diagrams with 20% human validation, demonstrating reasonable quality control efforts. - The finding that VLMs rely on language priors and exhibit superficial pattern matching is well-established in many works, such as VQA, IconQA, VLMs are biased, etc. The paper does not sufficiently differentiate its contributions from these existing works. - The near 2% gap between visual and semantic modalities is within noise margins and insufficient to claim memorization effects, and it is not a powerful claim. It can be changed in another experiment. - ER performance itself consists of two steps: OCR text extraction and visual element extraction. It should be discussed and investigated which one affects the low performance. Moreover, successful reasoning fundamentally depends on accurate entity recognition. If ER fails in one of those tasks, then subsequent reasoning operates on corrupted inputs, making high performance on KG/VR tasks despite poor ER performance inherently suspicious and indicative of shortcut behavior. - The superior performance of the models in Wikipedia data can be caused by the models' training on the same data and memorizing it. - In Clever-Hans Shortcut, while the blank-image experiment is interesting, the analysis is incomplete. There is a need for an analysis of question-answer correlation biases; there is also a need for a comparison with random baseline or majority class baseline beyond the 25% chance level; and lastly, the evaluation also misses an analysis of whether performance correlates with question linguistic features. - The paper is purely diagnostic with no proposed methods to mitigate identified shortcuts. As an example, a work published in ICLR 2025, "Chain-of-Region," used the OpenCV library to extract the visual data and feed it as text to the models. - How does CHIMERA differ from IconQA, ChartQA, and ERBench? These benchmarks also decompose visual reasoning into levels and test chart/diagram understanding. What unique contribution of CHIMERA compared to those other works? - If entity recognition provided accurate diagram content to the model, wouldn't reasoning tasks become trivial? An ablation study can be done in this matter by showing that fixing ER errors would improve downstream task performance. Lightly AI-edited
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding Soundness: 1: poor Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces Chimera, a large-scale benchmark for evaluating visual-language models (VLMs) on diagram comprehension. The authors argue that current benchmarks overestimate model understanding because they fail to detect shortcut behaviors. CHIMERA consists of 7,500 Wikipedia diagrams annotated with semantic triples and four levels of comprehension questions (entity recognition, relation understanding, knowledge grounding, and visual reasoning). Using this benchmark, the authors evaluate 15 open-weights VLMs across 7 model families, analyzing three shortcut types: visual-memorization, knowledge-recall, and Clever-Hans. Their findings suggest that Clever-Hans shortcuts significantly influence model performance, while visual-memorization and knowledge-recall shortcuts have smaller to moderate effects. - Addresses an important gap in multimodal reasoning evaluation, distinguishing true comprehension from shortcut exploitation. - Proposes a structured framework for diagram understanding grounded in semiotic theory. - Builds a large-scale dataset (Chimera) with hierarchical question design, enhancing diagnostic evaluation. - Evaluates a broad range of open-weights VLMs, offering valuable comparative insights. - The identification of three shortcut types provides a clear conceptual taxonomy. - **Human evaluation and visual dependency**: The authors conduct a human evaluation on 300 diagrams to assess visual dependency, QA correctness, and triple completeness. However, they do not report inter-annotator agreement, and standard deviations across categories suggest significant variability. This raises concerns about the reliability of the visual-dependency measure, which is critical for assessing the Clever-Hans shortcut. - **Ambiguity in fine-tuning**: The paper does not specify whether the 15 evaluated models were fine-tuned on CHIMERA. Fine-tuning could strongly influence performance and shortcut measurement, while zero-shot evaluation might reveal different behaviors. - **Statistical rigor in shortcut measurement**: The visual-memorization shortcut is claimed based on a small mean difference (~2%) between original diagrams and visualized triples. No standard deviations or significance tests are reported. Given that human evaluation found triples to be only 74–86% fully sufficient, this small difference may reflect dataset quality rather than model memorization. - **Task difficulty not human-validated**: Knowledge-recall shortcuts are inferred from ER (Entity Recognition) vs other tasks, assuming ER questions should be easier to respond for a VLM. However, no human study confirms relative difficulty. Automatically generated questions may introduce artifacts, so the 5% difference may not reliably indicate shortcut behavior. This can be assessed with a human evaluation which answer a sample of questions. If human performance doesn't align with VLM results on ER vs other tasks, it could reveal the influence of the knowledge-recall shortcut. - **Clever-Hans overstatement**: Clever-Hans shortcuts are measured only on ER questions. Larger models show minimal differences when diagrams are removed, suggesting that the observed effect is primarily driven by smaller models. Therefore, the claim that all VLMs suffer from Clever-Hans may overgeneralize. - **Unusual paper structure**: The paper merges introduction, related work, dataset description, and results into a single section, which makes it harder to clearly follow the flow of motivation, prior work, and methodology. **Questions for the Authors** 1. Were any of the evaluated models fine-tuned on CHIMERA? If yes, please describe the training procedure and specify whether comparisons include both fine-tuned and zero-shot results. 2. What was the inter-annotator agreement for the 300 manually evaluated diagrams? **Actionable Feedback** 1. Conduct a human study to verify that ER questions are indeed easier than the other three tasks. This would strengthen claims about knowledge-recall shortcuts. 2. Provide standard deviations or confidence intervals to assess reliability. Consider analyzing fully sufficient vs partially insufficient triples separately. 3. Discuss potential biases introduced by automatically generated questions (Gemini) and how they may influence shortcut detection. 4. Consider reorganizing the manuscript to separate introduction, related work, dataset description, and results for better readability and clarity. Moderately AI-edited
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. CHIMERA is a visual benchmark suite designed to evaluate vision-language models (VLMs) on diagram comprehension. It contains 6,000 training and 1,500 test diagrams. The authors identify three shortcut behaviors commonly exhibited by VLMs: (1) visual information memorization, (2) knowledge-recall shortcut, and (3) Clever-Hans shortcut. The latter two primarily arise from the models’ reliance on linguistic cues rather than genuine visual understanding. To address this, each CHIMERA entry integrates three modalities: visual, semantic, and textual. This enables a comprehensive assessment of model behavior. The benchmark defines four task levels: entity recognition, relation understanding, knowledge grounding, and visual reasoning. The authors evaluate across 15 VLMs revealing that much of their performance can be attributed to language bias rather than true multimodal reasoning. In particular, the Clever-Hans shortcut experiment exposes cases where models achieve high accuracy even when the visual input is omitted entirely. • The authors identify three prominent shortcut behaviors and base their analysis on these. • The findings from the Clever-Hans shortcut experiment are interesting, demonstrating that models do not utilize information from images for visual questions. • The construction of the test suite is well explained: the authors describe starting with Wiki Web2M and ultimately filtering 7,500 instances for CHIMERA through a semi-autonomous process. • The paper is well-structured, clearly written, and includes appropriate figures and tables. • Most models achieve high scores (above 80%) across the majority of tasks. This raises the question of whether the primary purpose of the benchmark is simply to show that models rely more heavily on textual modality. Since the questions themselves do not appear to pose substantial challenges, the benchmark’s diagnostic value seems limited. • The authors claim that it is surprising that models perform better on the visual modality than on the semantic modality (lines 334-338). However, given that most training data heavily represent visual and textual modalities, with comparatively limited exposure to semantic diagrams, such results are expected rather than surprising. • LLaMA3.2 and Gemini were used to construct the dataset, while the evaluation includes models from the LLaMA3.2 and Gemma3 families (also developed by Google). This overlap introduces a potential source of bias. • The authors show in Figure 4 and Table 2 that there is a 6–8% gap between the textual modality and the visual or semantic modalities, concluding that models perform best on text (line 333). However, I believe this conclusion is influenced by a few weaker models. After recalculating Table 2 using the data from Table 5 in the Supplementary, excluding LLaMA3.2-11B, LLaVA1.6-7B, LLaVA1.6-13B, and BLIP3-4B, the observed gap becomes much smaller, suggesting that the overall trend may not be as pronounced as reported. The recalculated average scores are: Visual - 90.7, Semantic - 89.3, and Linguistic - 93.3. As observed, the gaps become much less prominent, under 3% between the Visual and Textual modalities. This raises the question of whether the authors’ reported findings are broadly generalizable or primarily driven by a few underperforming models. • The bottom part of Figure 3 requires more clarity. Also at first glance, the Human Evaluation component is not clear. • In Table 1, Annotator C’s assessments show high percentages for visual dependency – partially dependent and triple completeness – marginally insufficient. What is the reason for this? The difference with other annotators is not minor. • The authors present model-wise performance in Table 5 of the Supplementary. I recommend that they include a concise visual summary of this information, or at least aggregate the results by model family within the main paper. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next