ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 4.00 4125
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 3.00 2249
Fully human-written 2 (50%) 6.00 4.00 3196
Total 4 (100%) 5.50 3.75 3192
Title Ratings Review Text EditLens Prediction
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure? Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces VLM-GeoPrivacy, a benchmark designed to test whether VLMs respect contextual integrity when disclosing image locations. It evaluates 14 leading VLMs on 1,200 real-world images annotated for context, intent, and appropriate disclosure granularity, showing that current models frequently over-disclose sensitive locations and fail to align with human privacy norms. Even advanced models like GPT-5 and Gemini-2.5 achieve only about 50% agreement with human judgments and are easily manipulated by iterative or adversarial prompts. The authors conclude that future multimodal systems must incorporate context-aware privacy reasoning rather than blanket disclosure limits to protect users from location-based privacy risks. + Evaluate many frontier models, providing a comprehensive view. + The framing of the paper is good. The motivation is clear. I like the idea that “models should be aware of whether the geo-information in specific images can be revealed.” The framing about three possible use cases in section 2.1 is good. + The questions are constructed based on some existing regulations. The authors used an iterative design of annotation for better refinement. - The conversion from the Nissenbaum’s Contextual Integrity and some established regulations to Q1-7 lacks some supports. [See my Q1] - Some designs look ad-hoc: Phi-3.5 for filtering, GPT-4o-mini for labeling, GPT-4.1-mini for the judge. [See my Q2] - As the dataset is the core part of this paper. Information about annotation process should be clearer. [See my Q3] 1. Line 155-156: “we found that adding these concrete, intermediate questions improves annotator consistency.” I want to know that how much do you think that the previous questions are deliberate guidance to annotators towards a desired result? The Krippendorff’s alpha = 0.83 is good. Do you test the alpha when Q1-6 are not given? I think people’s intuition should also be considered. Another possible test can be first letting annotators do Q7, then inviting them to give explanations using Q1-6 to see the contributions of Q1-6 in human decisions. 2. Do you have human verification on Phi-3.5-Vision filtering results? A better way may be using multiple VLMs and do majority voting? Do you use GPT-4o-mini for classifying images to the several privacy-sensitivity categories instead of Phi-3.5? Why changing to GPT-4o-mini for this task? 3. Line 238-244: I am still confused. Do you only annotate 400 images out of the 1,200? Where did you recruit annotators? How many images did each of them annotate? What was the average working hours and how much was paid per hour? What is the demographic information of the annotators? I am asking the last question because I think there locations, nationalities, birthplaces can affect their familiarity of the images, thereby affecting Q1. Can you also provide the location distribution of your 1,200 images? 4. Does “context” refer to Q1, Q4, Q5, and Q6? 5. Do you ask models Q1-Q7 in a consecutive context or separately? 6. Should elaborate more on the differences between the three privacy leakage metrics, i.e., Loc, Vio, and Over-Disc. 7. Do you compare models’ own Q7 granularity answers and their free-form generation extracted granularity answers? Is there any relationship? Currently from Table 2 the two accuracies vary a lot. Minor suggestions and typos: 1. Duplicated references: Line 565 to 571; Line 648 to 656. 2. Table 1: middle image, Q4, “buit” -> “but”. 3. Fig. 2: font size is too small. 4. Table captions should appear before tables. Fully human-written
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure? Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a visual contextual integrity benchmark for geolocating with VLMs. The benchmark consists of 1200 images with ground-truth location labels and seven human-labeled contextual questions/cues that define the privacy context of the image. Using this benchmark, various VLMs are evaluated for i) judging/replicating the fine-grained privacy context of the image, and ii) providing the appropriate level and accurate geolocation information. The evaluation shows that current models are heavily miscalibrated in terms of visual contextual privacy. Preliminary experiments with few-shot examples show promise for either inference-time or training-time improvements on contextual integrity. - Important and timely problem. - Sound methodology for constructing the benchmark. Especially appreciated are the efforts made to calibrate the labels and labeling questions well. - I believe the benchmark will enable targeted research on improving CI for VLMs. - Interesting and sound evaluation. Promising results with few-shot prompting. Kudos for evaluating different levels of adversaries for location inference. - The paper focuses solely on geolocation, which is easy to evaluate. However, as shown by prior work [1] (btw a missing relevant citation in this paper) VLMs are capable of inferring other private attributes from images as well, such as sex, age, or income. - This is maybe half a question: The paper currently defines the appropriate privacy context for each image according to global guidelines. However, in practice I could imagine that a user might not intend to share their location through a given image, independently of which privacy context it would fall into in the framework of contextual integrity. How could one account for that? Is CI the right tool in this case, or is it maybe indeed better if models were to simply refuse to do geolocation (or other private attribute inferences)? - The paper would benefit from a discussion of mitigating inference risks, both on the users' side and on the providers' side. - The paper does not comment explicitly in the main part on the geographic/cultural distribution/bias of neither the images nor the concrete instantiation of the contextual integrity framework. I would assume that at least some of the labels (both location and privacy context) would change depending on the given cultural interpretation of the framework. I believe to a certain extent this is already accounted for implicitly (e.g., Q1), but I wonder if the authors can add more to this. Obviously, this work is a first step, and a benchmark aimed explicitly and cultural variations and diversity is definitely more in the scope of follow-up work. **References** [1] Tömekçe et al., Private Attribute Inference from Images with Vision-Language Models. NeurIPS 2024. See weaknesses. Fully human-written
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure? Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces VLM-GEOPRIVACY, a 1.2 K-image benchmark for testing whether vision-language models (VLMs) respect contextual integrity when describing locations. Each image is labeled for visual recognizability, subject visibility, and an “appropriate disclosure” level (e.g., refuse / city / exact place). Fourteen open- and closed-source VLMs are evaluated in multiple-choice and free-form settings. Results show models often over-disclose (~50 % cases) and fail to judge privacy context correctly. * Problem is socially relevant: privacy and location disclosure are important deployment issues. * Benchmark and labeling are carefully designed * Evaluation covers many models and prompt styles, producing a clear quantitative picture. * Writing and visuals are clear The paper presumes that location disclosure is inherently undesirable and that the visual context alone suffices to infer disclosure appropriateness. In practice, location sharing on social media is often strategically self-disclosing—users may intentionally reveal or ambiguously hint at places for social signaling, identity performance, or prestige. Without modeling user intent, audience, or platform norms, the proposed notion of “contextual integrity violation” collapses into a moralized prior rather than an empirically grounded construct. Therefore, while the question (“can VLMs respect contextual integrity?”) is interesting, the methodology has limited interpretive value and limited insights: Models are optimized to output the most probable, semantically specific description given their training distribution. So, it would actually be unsurprising — and even expected — if a VLM provides accurate and detailed location descriptions. Over-disclosure is thus an expected by-product of likelihood maximization, not evidence of moral failure. The current experimental setup measures this natural behavior rather than revealing a new deficiency. How do you know the person truly didn’t want to share their location? (i.e. How do you label your data? and how can you be confident/certain that your judgments about “intent to disclose” are actually correct?) Human intent is complex — even people can’t reliably judge it (read peoples' minds), let alone models. Lightly AI-edited
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces VLM-GEOPRIVACY, a benchmark to test whether vision–language models (VLMs) respect contextual integrity when disclosing location from images. It assembles real-world photos with seven annotations (e.g., sharing intent, visibility of people, acceptable disclosure granularity), evaluates multiple generation regimes (vanilla, iterative reasoning, adversarial prompting), and measures over/under-disclosure and policy violations. Experiments on many VLMs reveal strong geolocation ability yet systematic over-disclosure, with few-shot, context-matched exemplars partially mitigating harm. 1. Timely and important problem: contextualized privacy in image geolocation is underexplored yet high-impact for deployment safety. 2. Evaluates a diverse set of VLMs under vanilla, chain-of-thought, and adversarial setups, yielding informative failure patterns. 3. Generally well-written with transparent descriptions of proposed dataset and metrics, making the study easy to follow. 1. Insufficient baselines weaken claims of “strong geolocation.” The paper compares only across VLMs; it lacks head-to-head evaluation against dedicated geolocation systems (e.g., retrieval-based pipelines) on the same test set. I suggest that the authors can add specialized geolocation and classical CV baselines [R1][R2], or report directly comparable numbers from prior work with a careful discussion of any differences. [R1] Ma, Wanlun, et al. "LocGuard: A location privacy defender for image sharing." IEEE Transactions on dependable and Secure Computing 21.6 (2024): 5526-5537. [R2] Clark, Brandon, et al. "Where we are and what we're looking at: Query based worldwide image geo-localization using hierarchies and scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 2. Single-judge dependence for free-form granularity grading risks bias. Using GPT-4.1-mini as the sole judge—while also being a system under evaluation—invites same-family preference and style-matching artifacts. It will be better to introduce at least one independent judge from a different provider/architecture and report inter-judge agreement; include tie-breaking rules. 3. Limited human auditing. The reported manual check (300 samples, ~96% agreement) is too small to ensure robustness across models and prompts. I recommend to expand human verification to ≥1,000 samples, stratified by model, prompt regime, and predicted label; report confidence intervals and error breakdowns. 4. Cultural and legal context sensitivity is under-analyzed. Heuristics (e.g., children/indoor/political rallies → restrict disclosure) may vary across jurisdictions and norms. It is recommended to add boundary-case discussions (e.g., public interest in mass protests vs. participant risk), collect cross-region annotations, and report inter-cultural agreement for key questions. 5. Data licensing and release plan remain unclear. The ethics statement says the dataset will be released under CC BY-NC 4.0, but the data-sourcing section lists platforms including Flickr and Shutterstock without clarifying whether the latter is redistributable under CC BY-NC. If not, please state explicitly whether all non-CC BY-NC–redistributable items were actually excluded and describe the filtering procedure. 6. Inference-time settings are not comparable or statistically characterized. Models use different reasoning modes/budgets; decoding uses temperature 0.7 without multiple runs. I suggest that the authors can standardize reasoning budgets, run multiple seeds, and report mean±std (or CIs) for all key metrics; consider deterministic decoding (temperature 0) for safety-critical refusal metrics. 7. Geoparsing pipeline may induce systematic bias. Location strings are extracted by a specific LLM and geocoded via a single API, which can over-resolve ambiguous or homonymous toponyms. The authors should report parsing/geocoding failure rates, ambiguity handling, and ablations with alternative extractors (regex/lexicons) and geocoders; analyze sensitivity of street/city/region accuracy to these choices. see above comments Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next