ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 4.00 4.00 3724
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 5.00 4.00 2304
Total 4 (100%) 4.50 4.00 3014
Title Ratings Review Text EditLens Prediction
Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates why current vision-language models (VLMs) struggle with understanding data visualizations such as scatter plots. The authors introduce FUGU (Fundamentals of Graph Understanding) — a new diagnostic benchmark designed to systematically test foundational spatial and mathematical reasoning skills necessary for chart interpretation, including counting, locating points, measuring distances, finding extrema, and computing means. Further proposed tasks include correlation, cluster, function, outlier. Using three representative VLMs (LLaMA-3.2, LLaVA-OneVision, and InternVL3), the study combines behavioral evaluation with activation patching and linear probing to trace information flow through model components. - **Clear motivation and positioning:** The paper addresses a relevant and timely question regarding the ability of VLMs to understand charts and data visualizations. The research problem is articulated clearly, focusing on identifying where failures in chart understanding may originate. - **Well-scoped contributions:** The proposed FUGU tasks are thoughtfully designed and cover a useful range of chart-understanding skills (e.g., counting, coordinates, extrema). The use of both causal interventions and linear probes provides an informative way to study model behavior beyond surface-level accuracy. - **Comprehensive experimental setup:** The experiments include several representative models (e.g., LLaMA-3.2, LLaVA-OneVision, InternVL3) and explore different task conditions. The findings offer a plausible explanation that the vision–language interface poses a key challenge, supported by a range of analyses. The visualizations and ablations help illustrate the results. - **Clarity and reproducibility:** The paper is generally well-organized and provides sufficient implementation detail, including appendices that describe task construction and probe configurations. This level of transparency should support reproducibility and future research on multimodal reasoning. - **Limited coverage of visualization types:** The analysis focuses mainly on Cartesian point-based charts (e.g., line and bar charts), where positional relationships naturally reflect values. However, for non-Cartesian visualizations such as pie charts or radar charts, angular information is equally critical. The current framework does not appear to account for these cases, limiting its generality across broader visualization types. - **Insufficient consideration of visual encoder scale and capacity:** The conclusion that visual encoders preserve nearly 100% of the relevant information is based on a single encoder configuration. The impact of encoder size, architecture, and pre-training strategy on this finding is not examined. Incorporating comparisons across encoder scales would strengthen the validity of this conclusion. - **Limited mechanistic insight into the vision–language bottleneck:** While the work identifies the vision–language interface as the main source of degradation, the analysis does not delve into *why* this interface fails. Potential factors—such as attention bottlenecks, misalignment in cross-modal token fusion, or representational loss during projection—are not explored. A more detailed mechanistic investigation would make the diagnosis more robust. - **Nature of the vision–language bottleneck:** The paper attributes the core failure to the vision–language interface, but the underlying mechanism remains unclear. Could the authors elaborate on what aspects of attention or projection layers may cause information loss at this stage? Additionally, if information is already lost at this layer, how do the observed performance fluctuations across subsequent layers arise? - **Encoder size and representation capacity:** The results suggest that positional information is fully preserved across tested vision encoders. How sensitive is this finding to encoder scale or architecture? Would smaller or differently pre-trained encoders yield similar preservation patterns? - **Scope of FUGU task diversity:** The benchmark focuses on scatter plots in Cartesian coordinates. Do the authors expect comparable failure behaviors in charts with non-Cartesian or hierarchical structures (e.g., polar plots, radar charts, treemaps)? Clarification on how the diagnostic approach would generalize to such cases would be helpful. - **Implications for future model design:** Since fine-tuning does not appear to resolve the bottleneck, have the authors considered alternative or hybrid architectural directions that could mitigate the issue? Insight into how these findings might inform future multimodal model design would strengthen the discussion. Fully AI-generated
Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates why modern vision-language models (VLMs) fail to understand data visualizations, arguing it's unclear if the fault lies in the visual encoder, language module, or their interface. To diagnose this, the authors introduce FUGU, a new benchmark of fine-grained "unit tests" for chart-understanding capabilities like extracting data points or calculating statistics. Evaluating three VLMs (LLaMA-3.2, LLaVA-OneVision, InternVL3), they find poor performance that degrades rapidly as the number of data points increases. Using diagnostic tools like linear probes and activation patching, the authors pinpoint a key bottleneck. They find that while the visual encoder does successfully capture low-level information (like coordinates), this information is "scrambled" or lost at the vision-language connector and in the early layers of the language model. This core architectural flaw, rather than a failure of visual perception, is the primary source of error and persists even after fine-tuning. The work's primary limitations are its focus on synthetic scatter/bar charts and the computational cost of its diagnostic methods. 1. Introduces FUGU, a diagnostic benchmark designed to "unit test" the fine-grained capabilities of VLMs on data visualizations. 2. Provides a clear and localized diagnosis for VLM failures, identifying the vision-language connector and early LM layers as the primary bottleneck. 3. The work appears reproducible due to clear descriptions of the FUGU benchmark tasks and the diagnostic methods. * The FUGU benchmark's scope is currently narrow, focusing on synthetic scatter plots (Sec 3.1). This makes it unclear if the findings and the identified bottleneck generalize to other common chart families (e.g., bar charts, line graphs, histograms) or to more complex, real-world visualizations with varied aesthetics, occlusions, or multiple panels. * The paper's contribution is primarily diagnostic. While it successfully identifies the location of the information bottleneck (Sec 5.3), it does not proceed to propose or empirically test any architectural solutions or mitigation strategies to address this flaw. * The evaluation relies on a judge prompt and regex-based checks for scoring (Sec 4.1). While the high agreement rate is noted, the lack of a human-rated validation subset with inter-rater agreement means the robustness of the automated scoring is not fully confirmed. * The tasks within FUGU (e.g., counting, position, distance, mean) appear to rely heavily on a single core skill: accurately extracting all (x,y) coordinates. This might over-weight one specific failure mode rather than testing a diverse set of reasoning capabilities. N/A Fully AI-generated
Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces FUGU, a controlled benchmark to diagnose where VLMs fail to understand basic charts. FUGU focuses on highly controlled, synthetic scatter plot tasks designed to probe models' abilities to extract and reason about quantitative information at multiple levels of complexity. The paper systematically evaluates three widely used VLM architectures (LLaMA-3.2, LLaVA-OneVision, InternVL3) using behavioral analysis, causal interventions (activation patching), and linear probes to track information flow and identify bottlenecks. The key findings implicate the hand-off between vision encoders and language modules as a major source of failure, rather than the underlying representation or reasoning capacities of either module in isolation. + The dataset design is simple, clear, and well-controlled, allowing for a clean analysis of specific model behaviors and error sources + The combination of multiple analysis methods provides different aspects to assess model behavior + The conclusion brought by linear probs experiments is interesting and convincing - The dataset scope is narrow. FUGU focuses on scatterplots with limited data points, no occlusion, and fixed glyphs. Real charts usually include bars/lines, partial occlusion, diverse scaled axes, and additional legends/annotations. It remains unclear whether the experimental results and conclusions would still hold for other chart types or in-the-wild data. - How robust are the conclusions if the scatterplot contains more points, and the axes are non-integer? Fully human-written
Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper seeks to investigate the bottleneck in Vision-Language Models (VLMs) for data visualization understanding. To this end, the authors introduce FUGU, a data visualization understanding benchmark consists of synthetic scatter plots and 9 tasks (5 basic and 4 ensemble). They evaluate three modern VLMs with mechanistic interpretability techniques (activation patching and linear probes). Based on the evaluation results, they argue that the bottleneck is caused by the vision-language handoff. • The flexibility and granularity (controlled visual attributes, and variable plot complexity) of FUGU enable precise probing of model weaknesses. The experiments are thorough, comparing three contemporary VLMs and systematically evaluating their performance with different visualization complexity and tasks. • The paper is well written and clearly introduces the motivation and experimental findings. - Limited generalizability of the dataset Because FUGU comprises only scatter plots, the generalizability of the paper’s conclusions is limited. Real-world data visualizations are more complex and diverse, incorporating various chart types (e.g., bar, line, pie charts), dense textual annotations, legends, and visual noise. The bottlenecks identified in this study may be specific to FUGU. It is important to evaluate whether the bottlenecks vary by visualization types and scale with both visualization complexity (from simple charts to rich infographics) and question complexity (from basic queries to complex data insight queries such as those in ChartQAPro). - The nature of the identified bottleneck seems task-dependent The claim of the VLM architectural bottleneck is weakened by the fact that this bottleneck task-dependent. In Section 4.6, one experiment demonstrates that providing ground-truth coordinates harms performance on ensemble tasks. This suggests that the language part of VLMs probably faces a bottleneck in this context. Thus, the bottleneck is not a fixed architectural limitation but a task-specific capacity mismatch between model components and task demands. This weakens the claim of a fundamental architectural limitation and shifts the contribution from identifying a fundamental architectural constraint to characterizing a capacity limitation in current VLMs. This is a more modest result than the paper suggests. - Results on fine-tuning do not fully support the claim In Sections 3.1 and 4.7, the authors claim that FUGU is extremely difficult for current VLMs and cannot be fully solved directly through fine-tuning. However, the fine-tuned InternVL3 14B performs very well on simple tasks (count 100, position 99.2). This suggests, first, that the bottleneck may shift after fine-tuning. Second, if a 14B model can solve simple tasks with fine-tuning, larger fine-tuned models may also handle additional FUGU tasks. - Why are the values of Gemini 2.5 Pro in Tables 2 and 3 missing? - Considering the good performance of Gemini 2.5 Flash in Tables 2 and 3, is FUGU not difficult for Gemini 2.5 Pro? - line 150: VLMS -> VLMs Fully human-written
PreviousPage 1 of 1 (4 total rows)Next