ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (67%) 3.00 4.00 3311
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (33%) 6.00 2.00 2294
Total 3 (100%) 4.00 3.33 2972
Title Ratings Review Text EditLens Prediction
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new benchmark for evaluating text-guided image editing abilities for existing models. The proposed core idea is to introduce two additional evaluation dimensions: i) functional correctness, where the paper proposes to use a VLM to assess it by doing multiple-choice questions; ii) image content preservation, where the paper uses off-the-shelf zero-shot segmentation model to extract the background regions for compatibility check. The paper further collects 1k editing examples for evaluating modern image editing models, and conduct extensive analysis on the collected data and a group of strong editing models including GPT-Image-1. + The work proposes to address an admittedly existing gap in the current image editing benchmarks. The protocols for assessing text-guided editing are generally global and fail to disentangle correctness from preservation. The proposal seems timely. + The automatic pipeline involving GPT, GroundingDINO, SAM and masked metrics is well-engineered. + The usage of multiple-choice VQA evaluation is more robust than binary yes/no formats (e.g., I2E-Bench), reducing chance accuracy and allowing large-scale automated evaluation. + The empirical evaluation seems extensive and thorough. - Although the introduction of QA-based functional correctness is interesting, the proposed benchmark, if I understand correctly, focus primarily on single-turn image editing. Multi-step, compositional, or iterative editing scenarios are missing and therefore limit the real-world applicability. - The scale of human evaluation seems limited. Human study in sec. 4.3 uses 100 examples with 4 annotators, which is relatively small to 1 k+ samples in the full benchmark. - The QA stage is heavily based on gpt-4o model. This could lead to potential model bias for the benchmark; future updates of the corresponding model would render the current report outdated. - The content preservation ability assessment still relies heavily on masked MSE/PSNR, which remains low-level and is similar to existing benchmarks setup. - Admittely, I am not very familiar with the current progress of this particular sub-domain. I would therefore also like to hear other colleague reviewers' opinions. Please refer to the weaknesses section for detailed questions. Thanks. Fully human-written
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes GIE-Bench, a grounded benchmark for *text-guided image editing* that jointly evaluates (1) functional correctness—whether the intended edit actually happened—via VQA-style *multiple-choice* questions, and (2) content preservation—whether non-targeted regions remain unchanged—via object-aware masking and masked similarity metrics. The dataset contains 1,080 curated image–instruction pairs across 20 categories and 9 edit types, each with an instruction, an object mask (from GroundingDINO→SAM), and an MCQ for automatic judging. Functional correctness is scored by a VLM judge (primarily GPT-4o; Gemini-2-Flash used for robustness), while preservation is computed with masked CLIP/SSIM/PSNR/MSE after SIFT/FLANN-based alignment. - Clear, two-axis evaluation that disentangles *did the edit happen?* from *what collateral damage occurred?*—a practical and under-measured trade-off in editing. - Object-aware preservation via GroundingDINO→SAM masks is a concrete improvement over global CLIP/LPIPS that confound edits with preservation. - Operational details (geometric alignment before pixel metrics; per-edit-type breakdown; deterministic judging; a second judge) increase reproducibility and confidence. - Breadth of coverage across 9 edit types and 20 content categories; balanced sampling helps per-type comparisons. - Human-metric correlations (e.g., masked CLIP/SSIM/PSNR/MSE vs. human preservation ranks) support the metric design. - MCQs and correctness judgments rely on frontier VLMs (GPT-4o/Gemini). This raises *construct validity* and *reproducibility* questions (model updates, access, and potential judge–system coupling). Publishing non-proprietary baselines (e.g., open-weights VLMs) would help. - Preservation hinges on the *inverted* object mask. Small mask errors (under/over-segmentation, ambiguous targets like “sky near horizon”) can mis-score preservation. Quantifying mask quality and its effect (e.g., via perturbation studies) is needed. - Single-turn, localized edits only; global style transfer, multi-step, and interactive refinement are out of scope, limiting ecological validity for real editing workflows. - GPT-Image-1 appears in aggregate numbers but is excluded from the human-study pool; this complicates judge↔human comparison for the most capable model. - Images come from a single stock repository; scene/style diversity and real-world artifact coverage may be narrower than web-scale distributions. - Beyond Spearman correlations, more detailed statistical testing (per-type confidence intervals; bootstrap across images; significance for model ranking deltas) would strengthen claims. - Even with 2–5 options, answer distributions or wording may create “easy distractors.” Reporting per-question difficulty and entropy, plus adversarial revisions, would increase robustness. - The paper compares against several benchmarks but could more directly *calibrate* its preservation/MCQ axes against recent human-alignment datasets and judge-based evaluations to quantify incremental benefit. 1. Can you report correctness with at least one *open-weights* VLM judge (e.g., LLaVA-Next-ViT-Qwen2-VL) to mitigate dependence on proprietary models, and release MCQs to enable exact replication? 2. How sensitive are preservation scores to mask dilation/erosion (±k pixels) and to imperfect detections? Please provide curves and error bars. 3. Which edit types show the largest judge disagreement (GPT-4o vs. Gemini)? Any systematic MCQ failure modes (e.g., spatial terms, size ratios)? 4. The calibrated preservation score maps masked MSE to \([0,1]\). What normalization is used (per-type, per-image, global), and how does that choice affect rankings? 5. Why was GPT-Image-1 excluded from the human-study pool while included elsewhere? Could you add a small human study including it to close the loop? 6. Do results hold on other image sources (e.g., LAION subsets, COCO, web photographs) and on *global* edits (style transfer) when masks are “entire image”? 7. Any preliminary results on chained edits where preservation compounds across steps? 8. Did you analyze category imbalance effects (e.g., human faces vs. landscapes) on preservation/correctness? Fully AI-generated
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces GIE-Bench, a new benchmark for evaluating text-guided image editing models. The authors argue that current evaluation methods, like CLIP scores, are too vague. GIE-Bench gets down to the nitty-gritty with a cool two-pronged approach. First, it checks for "functional correctness" by using an AI-generated multiple-choice question to see if the model actually followed the editing instruction. Second, it measures "content preservation" by using smart object masking to check if the model messed up parts of the image it wasn't supposed to touch. After testing top models like the new GPT-Image-1, they found that while it's great at following orders, it often over-edits the background. GIE-Bench offers a more precise and scalable way to see what these editing models are truly good (and bad) at. However, the paper lacks sufficient innovation. 1. It's not just about whether the edit happened, but also about what didn't happen. By separating "functional correctness" from "content preservation," the benchmark gives a much more complete picture of a model's performance. 2. Using object masks to evaluate only the unedited parts of an image is a brilliant move. It stops penalizing a model for making the correct change and focuses squarely on unintended collateral damage. 3. Fully Automated & Scalable: The entire pipeline—from generating questions and masks to scoring the results—is automated. This makes it easy to use, reproduce, and scale up for testing tons of models on thousands of images. 1. Only handles one-shot edit. The benchmark is designed for simple, single-step instructions ("change the car to red"). It can't evaluate more complex, real-world scenarios where a user might give a series of commands or have a back-and-forth conversation to refine an image. 2. The introduction of a target mask is of limited importance to the development of current image editing benchmarks. 1. The paper a key trade-off where models strong at instruction-following (like GPT-Image-1) are weaker at content preservation. Why do you think this is happening? Is it an architectural problem or a fundamental conflict in how these models are trained? 2. Looking ahead, what do you think is the biggest hurdle in expanding GIE-Bench to evaluate multi-turn, conversational image editing? Would it require a completely new way of thinking about evaluation? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next