ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 3.00 4.00 3962
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 3.50 2878
Total 4 (100%) 3.00 3.75 3420
Title Ratings Review Text EditLens Prediction
PairBench: Are Vision-Language Models Reliable at Comparing What They See? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces **PAIRBENCH**, a systematic framework for evaluating the comparison abilities of vision-language models (VLMs), together with four key metrics, including alignment with human, order-invariance, score diversity and instruction following. The benchmark uses existing datasets (ImageNet, COCO, and WhatsUp) to create identical, transformed, and irrelevant image pairs, with human annotations defining ground-truth similarity. The authors benchmark a wide range of open and closed models (GPT, Gemini, InternVL, Qwen, LLaVA, etc.) and show that no model performs consistently across all metrics. * PAIRBENCH fills a clear gap in multimodal evaluation by focusing on pairwise relational reliability rather than standard task-specific benchmarks. * The use of both invariant and sensitive prompt conditions provides a nuanced test of controllability and instruction following. * Correlation analysis with external benchmarks (AI2D, HallusionBench, MMMU, etc.) strengthens the empirical insight that comparison ability predicts higher-level reasoning. * The framework assumes that human similarity judgments are the gold standard but does not discuss possible variability or subjectivity in those scores. Human judgments can vary substantially across individuals, and there is a lack of such analysis. * The benchmark’s image pairs are generated through basic transformations (rotation, crop, color jitter). These transformations test low-level invariance rather than semantic equivalence. Real-world comparisons often involve semantically similar but visually diverse images, such as cross-domain or stylistically different examples. As a result, PAIRBENCH may not accurately reflect the reliability of VLMs in practical use cases (e.g., retrieval or multimodal evaluation). * The paper does not explore using image generation models (e.g., Stable Diffusion, DALL·E) to create semantically similar but distributionally distinct image pairs. Such generative approaches could produce more realistic and diverse comparisons. This would help test whether VLMs capture conceptual similarity rather than superficial pixel-level correspondence * No systematic study of which types of transformations or linguistic prompts most affect model performance. You mention that prompt phrasing significantly affects MMScore. Could you provide quantitative results on how large this variance is (e.g., standard deviation across prompts)? Fully AI-generated
PairBench: Are Vision-Language Models Reliable at Comparing What They See? Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose PairBench, a benchmark for measuring the capability of Vision-Language Models as similarity kernels. PairBench is composed of image-image and image-text pairs that belong to one of three groups: (i) identical pairs, featuring almost the same image (or corresponding caption); (ii) transformed pairs, involving the transformation of an image with one of six transformations and (iii) irrelevant pairs, where the images are distinct. The task is to assign a similarity score to the pair under two conditions: one where the model should be invariant to a specified transformation and another where it should be sensitive. Model similarity scores are then assessed with four separate metrics measuring correlation with ground-truth judgements, symmetry, controllability and smoothness of the similarity distribution. The authors build on top of pre-existing datasets to construct PairBench and evaluate a broad range of vision-language models on their benchmark, finding variability in rankings of models for different metrics. - The four metrics considered are each well-motivated for the task of evaluating models as similarity kernels. - The authors' evaluation of a broad range of models with different sizes, different strengths and different access (open/proprietary) allows them to describe interesting trends in task performance. Each of the evaluations are made more robust through the inclusion of multiple prompts. - The insights regarding the lack of similarity are important, as they directly affect the cost-effectiveness of LLM judges for the task of similarity estimation. I found the relatively poor performance of the Gemini models surprising. - The paper is clearly written throughout. - My main criticism, and the reason for the soundness score and the overall rating, is the nature of the transformations. The only images that can be given a high ground-truth similarity score (10 or 6, from the fixed set of scores) are identical or transformed images. In real applications, however, practitioners are likely to be computing the similarity of distinct images (including those found in Figure 16 that are not identical but could be considered similar due to them both containing birds). Treating all distinct images as maximally dissimilar regardless of semantic content makes it difficult for me to see how insights from the benchmark can be generalized. Though perhaps I am missing something in the pair construction or the ground-truth score assignment. - As a corollary, setting ground-truth similarity scores to only 3 distinct values could affect the validity of MMScores. I would be curious to see how model correlations with judgements change when the 300 image pairs in the human study are used alongside the actual human judgements themselves. - I am not sure whether the claim in the introduction regarding reasoning performance depending on models functioning as effective similarity kernels holds. These two values are correlated, but that should not imply causation. From what I can tell, model performance on PairBench is correlated with the model's overall quality, as more recent and larger models tend to achieve higher MMScores. - It is unclear to me how to interpret the smoothness or the controllability results, as model performances are tightly clustered around similar values. - Just to make sure that my understanding is correct: For identical and transformed cases, the second is image is always based on the first image, correct? Furthermore, my understanding is that the ground-truth similarity score is based purely on the pair category and the prompt type. Would that be the right interpretation? That is at least my reading of the paper, but Figure 16 makes me doubt my understanding. Fully human-written
PairBench: Are Vision-Language Models Reliable at Comparing What They See? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on a critical issue of the VLM's visual comparison capability. Currently, applications of the visual comparison like automatic evaluation, retrieval re-ranking, and retrieval-augmented generation lacks systematic assessment. Existing evaluation methods either fail to isolate comparison ability from other model capabilities, rely on costly human annotation, or overlook critical flaws in state-of-the-art VLMs (e.g., score asymmetry when reversing image pair order). To address this, the author proposes PAIRBENCH, a low-cost, reproducible framework for evaluating VLMs’ visual comparison reliability. The framework’s design is anchored in two key components: (1) A controllable dataset of 70K pairs, derived from three public benchmarks (ImageNet, MS-COCO, WhatsUp), with pairs categorized as Identical (near-duplicate/images-text matches), Transformed (images with controlled manipulations like color jitter), or Irrelevant (unrelated content); ground truth (GT) scores are validated via 70+ human annotators. (2) Four complementary metrics to quantify reliability: MMScore (Kendall Tau alignment with human rankings, primary metric), ε-RelaxSym (order consistency, ε=1), Smoothness (score distribution diversity via entropy), and Controllability (responsiveness to prompt instructions). Experimental results across VLMs reveal no model dominates all metrics (e.g., GPT-5 achieves the highest MMScore (83.63%) but lower Smoothness, while InternVL2.5-8B leads open-source models (MMScore: 81.50%)). Critically, PAIRBENCH metrics (notably MMScore) exhibit strong correlations with complex reasoning benchmarks (90% with MMMU, 80% with HallusionBench). Methodological Rigor in Dataset and Metric Design: The dataset is controllable and human-validated, which is derived from 3 public benchmarks (ImageNet, MS-COCO, WhatsUp) to ensure reproducibility, with pairs categorized into Identical/Transformed/Irrelevant to isolate specific visual variations. Ground truth scores are validated via 70+ annotators, avoiding subjective biases in "gold standard" labels. Insights of the correlation between VLMs’ visual comparison ability and their performance on complex multimodal tasks: This insight upends the view that "visual comparison is a trivial, isolated task". Instead, it frames comparison ability as a core engine for complex tasks. Disconnect between the bench’s Pair design and practical visual comparison demands: The current framework defines three Pair types (Identical/Transformed/Irrelevant) based on trivial visual manipulations (e.g., color jitter, basic rotation, Gaussian blur) or simple content overlap (e.g., near-duplicate images, random irrelevant content). However, these designs fail to capture the core real-world capabilities users actually care about: specifically, a model’s ability to (1) recognize the same object across diverse perspectives (e.g., a chair viewed from the front vs. side vs. top, with varying lighting/occlusion) and (2) establish cross-scenario consistency (e.g., identifying a cat in a studio photo vs. a street scene vs. a cartoon illustration). For example, the "Transformed" Pair in PAIRBENCH only applies superficial, single-dimensional changes (e.g., shifting hue or rotating by 90°) that rarely challenge a model’s fundamental object identity cognition. In contrast, real applications (e.g., retail product retrieval, autonomous driving object tracking, medical image diagnosis) require models to ignore viewpoint or scene variations while preserving object consistency. This "toy-like" design reduces the bench’s ecological validity: even if a model performs well on PAIRBENCH, it provides little assurance of its utility in scenarios where visual comparison truly matters. The paper also fails to justify why it prioritizes trivial manipulations over these high-stakes real-world needs, weakening the motivation for the bench’s existence. Inappropriate statement: The paper’s core value proposition that framing PAIRBENCH as an "efficient surrogate for model selection" (via MMScore’s correlation with complex benchmarks like MMMU) suffers from flawed causal inference. The observed strong correlations (e.g., 90% between MMScore and MMMU) are not evidence of PAIRBENCH’s "representativeness," but rather likely a byproduct of a confounding variable: the model’s general visual understanding ability. A model with strong inherent visual capabilities will naturally perform well on both PAIRBENCH (which relies on basic visual similarity judgment) and complex multimodal tasks. The paper provides no evidence to rule out this alternative explanation: for instance, it does not test whether there exist models with weak general visual ability but strong PAIRBENCH performance (that then fail at complex tasks), or vice versa. Without such control experiments, the claim that PAIRBENCH is a "valid surrogate" collapses into a trivial observation: "models good at visual tasks are good at visual tasks." 1. Do you have examples of models with strong general visual ability (e.g., high ImageNet accuracy) but poor PAIRBENCH performance (due to weak comparison-specific skills like order consistency)? Conversely, are there models with weak general visual ability but strong PAIRBENCH performance that still fail at complex tasks? 2. Can you provide a concrete example where a model’s PAIRBENCH performance directly predicted its success (or failure) in a non-toy visual comparison task or visual cues comparison task? Fully AI-generated
PairBench: Are Vision-Language Models Reliable at Comparing What They See? Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces PAIR Bench, where there used this to systematically assess vlms ability to make reliable judgements over data pairs(image-text, image-image). They used off-the shelf datasets, a set of controlled image transformations and human-annotated ground truth to create the benchmark and then used four key evaluation metrics. - The paper tackles one issue: As people are using more and more VLMS as a judge, how trustworthy it is as an automated evaluator or rankers. - The benchmark uses existing datasets with controlled transformers. They conducted a thorough evaluation of multiple VLMs. Using different prompt to evaluate was a good point as well. - The paper is easy to read. - Having a benchmark on pairwise images and VLMs that fail to recognise it, is hardly new. [1][2] - The method merely uses basic data augmentations to create the benchmark. These do not represent what users use VLM in real life to compare or judge between samples. These are mostly synthetic or self-generated pairings. The task seems to be more of a visual robustness test. - A proper ablation study of why models fail, whether it is a vision encoder problem, or the alignment problem, is needed. Chain of thought prompting or in-context learning should also help models. - Feature analysis on the vision foundation models or cosine similarity of features from [maybe from florence, coca] those two images, would be good ablation study. It would add some value as it's a perception problem or language prompting problem. - The paper claims that high correlations between pairbench metrics and multimodal benchmark performance. That means PAIRbench is not unique. - More information on how dataset spilt was used from an shelf dataset is needed. [1] ReMI: A Dataset for Reasoning with Multiple Images [2] MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs - I fail to understand the novelty of the benchmark. Author may further clarify why this is unique relating with real-world scenario, Fully human-written
PreviousPage 1 of 1 (4 total rows)Next