ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 2 (40%) 5.00 3.50 2532
Lightly AI-edited 1 (20%) 8.00 3.00 1836
Fully human-written 2 (40%) 5.00 3.50 3602
Total 5 (100%) 5.60 3.40 2821
Title Ratings Review Text EditLens Prediction
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces SpinBench, a cognitively grounded benchmark for evaluating spatial reasoning in VLMs. It decomposes perspective-taking into seven diagnostic categories (e.g., identity matching, dynamic rotation, mental rotation). Evaluating 37 VLMs, it identifies systematic biases (egocentric preference, rotation failure, inconsistency under symmetry) and reports correlations between human response times and model accuracy, arguing that SPINBENCH captures cognitively meaningful difficulty. 1. **Testing 37 VLMs**—including proprietary and open-source—offers unusually broad coverage. 2. **Controlled Variations**: The introduction of controlled task variations—particularly the "with-premise" vs. "without-premise" conditions and the "allocentric vs. egocentric" reference frames—is a significant strength. These variations allow for a fine-grained disentanglement of visual grounding failures from spatial reasoning failures. 3. **Human-VLM Correlation**: The finding that human response time correlates strongly with VLM inaccuracy (Fig. 5b) provides strong validation for the benchmark's design, suggesting it measures a fundamental and shared aspect of spatial reasoning difficulty. 1. Prior work already targets spatial cognition broadly (SPACE), mental modeling (MindCube), and multi-perspective localization (ViewSpatial). Your unique bits are FoR-controlled augmentations and pairwise consistency, but the manuscript underplays direct, quantitative comparisons and ablations demonstrating why SPINBENCH exposes distinct phenomena beyond these. Provide head-to-head overlap/orthogonality analyses, cross-benchmark correlations, and case studies where SPACE/MindCube/ViewSpatial fail to reveal the same error mode. 2. Restricting all relations to a 2D horizontal plane undermines the generalizability of “spatial reasoning.” Without vertical or containment relations, the benchmark measures planar spatial alignment, not full 3D reasoning. 3. The model set omits several state-of-the-art commercial VLMs (e.g., GPT o3/5, Google Gemini 2.5 pro), which weakens the headline claim about “current VLMs.” They have more robust and powerful ability. I am curious about the true value of the dataset, as I strongly suspect that it may merely overfit to its own format. If a base model (e.g., Qwen-VL) could be fine-tuned on this dataset and subsequently demonstrate performance gains on other benchmarks (such as VSI-Bench, OmniSpatial and SPACE), I would be much more inclined to recognize the dataset’s contribution. Moderately AI-edited
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces SpinBench, a new diagnostic benchmark for evaluating spatial reasoning in VLMs. The benchmark is cognitively grounded, designed around the core challenge of perspective-taking, which is the ability to reason about how scenes and object relations change under viewpoint transformations. SpinBench decomposes this complex skill into seven fine-grained task categories. These tasks are progressively structured, scaffolding from simpler single-object challenges to more demanding multi-object perspective-taking scenarios. A key contribution is the use of controlled variations to probe model robustness. The authors evaluate 37 state-of-the-art VLMs. The results reveal systematic weaknesses in current models, such as a strong egocentric bias, a poor understanding of rotation, and high levels of logical inconsistency. The study also includes a human baseline (91.2% accuracy) and finds that task difficulty for humans strongly correlates with VLM accuracy, validating the benchmark's focus on fundamental cognitive challenges. + While many spatial reasoning benchmarks exist, SpinBench is novel in its cognitively grounded approach, systematically decomposing the complex, high-level skill of perspective taking into a progressive hierarchy of 7 diagnostic sub-skills. The explicit testing of allocentric vs. egocentric frames, visual grounding vs. linguistic inference, and logical consistency provides a multi-dimensional, surgical tool for diagnosing model failures, which is a unique combination not present in prior work. + The benchmark is well-motivated by foundational work in cognitive science. The data curation is thoughtful, drawing from four diverse domains (synthetic indoor scenes, real-world objects, cars, and faces). + The findings are specific, insightful, and actionable: the discovery of a strong egocentric bias, the fact that models fail even at purely linguistic spatial reasoning, and the high degree of logical inconsistency all point to fundamental gaps in current models. + The paper is overall well presented and clearly formulates the motivation. + The benchmark only contains 2599 samples in total, distributed across 51 distinct task subtypes. This distribution means that some of the fine-grained subtypes have a very small number of samples. The reliability of the performance metrics on such small test sets is questionable. A larger sample size for these specific subtypes would strengthen the conclusions. For example, the Mental Rotation task is presented as a key cognitive inspiration for the benchmark in the introduction. However, this category seems underdeveloped compared to the others. It contains only 78 samples. + Despite being evaluated with diverse VLMs, it remains unclear what the upper bound performance is on SpinBench for existing VLMs, that is, SOTA models like GPT-5 or Gemini 2.5 Pro are not tested. 1. I am concerned about the low sample size for some of the 51 subtypes. How did you ensure the statistical reliability of the results for these specific subtypes? 2. The scaling analysis in Figure 6 is interesting, particularly the emergent capability in Identity Matching. Did you observe similar emergent patterns in the most complex tasks, like Perspective Taking and Mental Rotation? Or did performance on those tasks scale more smoothly, even if starting from a very low, chance-level baseline? Fully human-written
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors introduce SpinBench, a spatial reasoning benchmark for VLMs. SpinBench measures the perspective-taking abilities of VLMs through seven tasks, six subtasks representing skills required to accomplish the seventh perspective-taking task. They conducted extensive experiments on their benchmark using 37 VLMs, finding that VLMs exhibit egocentric bias in spatial reasoning and showing that the difficulty of each SpinBench problem correlates with human difficulty in solving the same problem. * The paper provides a comprehensive evaluation of 37 VLMs of varying sizes, including two reasoning models. * The paper shows that VLMs fail in perspective-taking when performing spatial reasoning tasks. * The consistency metric reveals an interesting spatial reasoning bias in VLMs, and the connection between consistency and spatial reasoning accuracy is interesting. * A detailed and comprehensive description of the construction of the benchmark is included. * While the observation that VLMs exhibit egocentric bias in spatial reasoning is interesting, the general observation that VLMs exhibit egocentric bias is not a novel observation, as observed by [1,2]. * Related to the previous point, it is unclear what is gained by developing the seven categories used in SpinBench. The authors motivate the seven categories based on six basic spatial reasoning abilities which scaffold towards perspective-taking. However, it is unclear what useful conclusions can be drawn by devising the benchmark in this way. Experiments showing that improving the six simpler diagnostic abilities leads to better perspective-taking ability would emphasize the usefulness and necessity of the benchmark. As it stands, I am not convinced of the practical usefulness of SpinBench. * The results on CoT reasoning are inconclusive: Cosmos-Reason1-7B significantly benefits from CoT, whereas SpaceOm only enjoys moderate benefits, and SpaceThinker shows degraded performance on perspective-taking without premise. * While the proposed benchmark shows some interesting observations and correlations, my main concern is that the benchmark does not indicate the direction of causality: the six simpler reasoning tasks may lead to improved perspective-taking abilities or perspective-taking abilities may lead to improved performance on each spatial reasoning task. Fine-tuning the model based on the six simpler tasks and seeing if it leads to improved perspective-taking would show a useful and more interesting observation which can be used to improve existing VLMs. Minor Weaknesses: * In Figure 2, the legend “Static Inter-Object” does not match the description “Object-Relation Grounding” in the text. --- [1] Seeing through their eyes: Evaluating visual perspective taking in vision language models, Góral et al., 2024 [2] 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark, Wufei et al., ICCV 2025 * What could be the reasoning behind the inconsistent impact of CoT reasoning: Cosmos-Reason1-7B and SpaceOm show improved performance whereas SpaceOm shows degraded performance on perspective-taking. * In Figure 4 and L317-323, the authors show that consistency is strongly correlated with accuracy. However, this correlation may only imply that strong spatial reasoning leads to strong consistency which is obvious. It would be more interesting if the other direction were true: strong consistency leads to stronger spatial reasoning. Is there evidence on whether stronger consistency leads to stronger spatial reasoning? * In Figure 3, most models perform much better on the car_view_back task than the other canonical view selection tasks. Is there any possible explanation why this task is particular easy for most VLMs? * Why do the mental rotation tasks consist of only images rendered from ABO Objects? Fully human-written
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes SpinBench, a cognitively grounded benchmark designed to evaluate spatial reasoning in vision-language models, focusing on perspective taking and viewpoint transformation. It includes seven structured task types across 2,599 samples and 51 tasks. Evaluating 37 VLMs, the study finds consistent weaknesses such as egocentric bias, poor rotation understanding, and low consistency under symmetry and phrasing changes. Human-model correlations further confirm that SpinBench effectively captures genuine spatial reasoning challenges. 1. SpinBench offers a standardized, fine-grained, and vision-centric framework for analyzing the spatial reasoning capabilities of existing methods. Unlike previous benchmarks that primarily focus on semantic recognition, SpinBench introduces tasks such as perspective taking to effectively evaluate whether models can comprehend camera movements and exhibit true spatial modeling ability. 2. The experiments conducted on 37 VLMs are comprehensive and include an in-depth analysis of their performance. 3. The paper is well-written and easy-to-follow. The task definitions are reasonably motivated. 1. Although SpinBench includes a diverse range of question types, its overall data scale is relatively small compared to existing spatial reasoning datasets such as VSI-Bench and ViewSpatial-Bench. 2. SpinBench primarily evaluates VLMs using image inputs. A valuable extension would be to include baselines where VLMs take both images and their corresponding depth maps as input (e.g., generated by DepthAnything). This addition would help determine whether poor model performance stems from insufficient spatial information in the input data or from the model’s limited spatial reasoning ability. I have no more questions to the author. The paper clarifies most benchmark details well. Lightly AI-edited
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces SpinBench, a high-quality diagnostic benchmark for evaluating spatial reasoning in VLMs, with a focus on perspective-taking and mental rotation. Its design is cognitively-grounded and features a progressive structure that isolates specific cognitive abilities. Extensive experiments uncover systematic failures like egocentric bias and inconsistent reasoning, along with detailed discussions from various aspects. The benchmark's difficulty is validated against human studies, showing a strong correlation between VLM accuracy and human response times. 1. Benchmark design. SpinBench is cognitively-grounded and diagnostically powerful with its progressive structure (from simple perception to complex perspective-taking) and controlled variations (e.g., consistency checks, frame-of-reference shifts), allowing for precise identification of model failures. 2. Comprehensive and insightful discussion. The large-scale evaluation of various VLMs along with the analysis yields significant findings in various aspects, including the exposure of systematic biases (egocentrism), inconsistent reasoning, and distinct scaling patterns (emergent vs. smooth) across different spatial skills. 3. Strong validation with human studies. The strong correlation between VLM accuracy and human response time provides compelling evidence that the benchmark measures genuine cognitive difficulty, justifying its credibility. 1. Lack of architectural analysis. While the evaluation is extensive, the paper misses an opportunity to connect observed performance differences to specific architectural choices in the VLMs (e.g., vision encoders, fusion methods). Such analysis could provide more direct guidance for future model design. (This is minor. ) 1. In your scaling law analysis (Figure 6), you show that Identity Matching exhibits a sharp emergent capability, whereas Object Relation Grounding improves smoothly. This is a fascinating contrast. What do you believe is the underlying reason for this difference? Does it imply that cross-view object abstraction is a fundamentally different and more complex type of computation that only "activates" at a certain scale, compared to in-scene relation extraction? 2. Have you considered evaluating the impact of different image resolutions or aspect ratios on model performance? Given that spatial relations can be sensitive to visual details and object boundaries, it would be interesting to know how robust these models are to such perceptual variations. Missing . in L288 Moderately AI-edited
PreviousPage 1 of 1 (5 total rows)Next