ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 3.00 3981
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 4.00 4.00 4647
Fully human-written 1 (33%) 2.00 5.00 1731
Total 3 (100%) 4.00 4.00 3453
Title Ratings Review Text EditLens Prediction
CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces COPATCH, a training-free framework for zero-shot referring image segmentation (RIS) that enhances spatial grounding by (i) augmenting CLIP’s local text features with context tokens and (ii) extracting patch-level image features from an intermediate visual layer to build a context-aware spatial map (CoMap). The method clusters high-similarity patches and uses the clusters to re-rank candidate masks from a pre-trained segmenter, then applies a spatial coherence filter when explicit positional cues are present. The authors argue that CLIP is not intrinsically “spatially blind”; rather, spatial information exists in intermediate representations and can be surfaced without the need for gradient or attention maps. Experimentally, COPATCH achieves strong zero-shot mIoU across RefCOCO/+/g and PhraseCut, often surpassing prior zero-shot and even weakly/pseudo-supervised baselines; top-1 improvements range roughly +2–7 mIoU, with much larger margins when reporting top-3 oracle metrics. The approach is hyperparameter-light (exit layer, threshold, and spatial coherence weight) and is claimed to require only a single forward pass for the map. Qualitative examples demonstrate better localization than Grad-CAM-based pipelines, and ablation analysis attributes the gains to both context-token text features and CoMap-guided candidate selection. - Leveraging intermediate CLIP patch embeddings, along with lightweight clustering, to recover spatial cues is elegant and avoids fine-tuning or the use of expensive gradient maps. The paper provides evidence (in the form of layerwise similarity trends and latency comparisons) that this approach is both practical and efficient.  - Results on RefCOCO/+/g and PhraseCut demonstrate state-of-the-art zero-shot performance across multiple CLIP/DFN backbones. Ablations isolate the contributions of context tokens (CT), top-candidate (TC) re-ranking via CoMap, and spatial coherence (SC).  - Visualizations demonstrate a sharper focus on the referred object than Grad-CAM-based IteRPrimE; the spatial/non-spatial subset analysis strengthens the claim that both spatial grounding and semantics improve. - Many baselines utilize SAM or FreeSOLO proposals, whereas COPATCH employs Mask2Former and also reports a Mask2Former upper bound. Without re-running baselines under the same proposal pool, some gains may reflect differences in candidate quality rather than scoring alone. A stronger fairness control would standardize the mask generator across methods.  - The paper frequently highlights considerable “Top-3” mIoU improvements, which reflect whether the correct mask appears among the top-k, not whether the method selects it. Unless coupled with an interaction or an automatic picking rule, these numbers are diagnostic rather than directly actionable.  - The negation of patch embeddings to fix an “inversion artifact,” the clustering threshold δ, and selection logic (including negative text features + SC) are plausible but somewhat ad-hoc; theoretical motivation or broader sensitivity analysis (beyond the 10% val tuning on RefCOCOg) would improve confidence in robustness.  - The extraction of “primary noun + context tokens” hinges on chunking/parsing rules and English expressions; the paper does not show multilingual or out-of-grammar robustness, and it is unclear how failures in chunking affect CoMap and mask scoring. (Appendix references and tool citations suggest rule-based tokenization choices, but details are light in the main text.) - Can the authors re-run the strongest prior zero-shot baselines using the same Mask2Former proposal set to isolate scoring improvements from proposal quality? If not, can they at least show COPATCH scoring on SAM proposals to bridge the comparison?  - How sensitive is the context-token extraction to parsing errors or non-English phrasing (e.g., multilingual RefCOCO variants)? Are there any failure analyses that show when CT helps versus when it hurts? Fully AI-generated
CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents a method of leveraging spatial knowledge in the pretrained model to improve zero-shot referring image segmentation. In particular, the proposed CoPatch constructs hybrid text features by considering context tokens that carry spatial cues. In addition, CoPatch extracts patch-level image features using a novel path found from intermediate layers. Then, these enhanced features are employed to calculate image-text similarity map. (1) The motivation is well presented of leveraging spatial knowledge in the pretrained model to improve zero-shot referring image segmentation. (2) The illustrations is mostly clear and intuitive of the context tokens, the patch-level spatial map, the fusion and the reranking of top candidate masks. (1) The contribution is limited. In particular, the task setting is not clearly presented where this zero-shot setting is not well distinguished with the supervised setting since a pretrained model is employed. In addition, the experimental comparison seems to be unfair and problematic. (2) On Page 6, Line 292-293, the comparison is unfair since the proposed CoPatch uses Mask2Former as pretrained model while all the SOTA methods does not. (3) In Table 1, Mask2Former Upper Bound performs much better than Mask2Former +CoPatch, it is wired and there has no explanation. (4) On Page 7, Line 347-350, the oracle performance is not well explained which makes the comparison and discussion very confusing. In addition, the results are not well organized to show a clear comparison. (5) On Page 7, Line 377, still on the oracle performance issue, the low performance is not mainly due to pretrained models. It seems to have nothing to do with the main experimental results. No. Fully human-written
CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. CoPatch is a training-free, zero-shot referring image segmentation (RIS) method. It aims to improve spatial grounding in CLIP by (1) constructing hybrid text features that augment the primary noun with context tokens (CT) (e.g., "woman smiling"), and (2) building a patch-level spatial map, CoMap, from intermediate CLIP visual layers. The CoMap is generated by computing patch-text cosine similarity, crucially applying a negation (sign-flip) to the patch embeddings before projection to correct for "opposite visualization" artifacts . This map is then clustered and used to re-rank candidate masks generated by an off-the-shelf model. The approach shows improved zero-shot mIoU over baselines on RefCOCO/+/g and PhraseCut. Its Top-3 performance is exceptionally strong (e.g., +16.28 mIoU over Top-1 on RefCOCOg val), suggesting the CoMap is highly effective at identifying good candidates, even if the final Top-1 selection is imperfect. 1. **Simple and well-motivated method**: Based on the experimental results, tapping intermediate layers for spatial knowledge and using a simple negation to fix inverted saliency maps are highly effective. 2. **Strong performance**: Significant boost from Top-1 to Top-3 results (Table 1) validates the core claim - CoMap is a superior guide for scoring and ranking mask proposals compared to standard global-local similarity. 3. **Good component analysis**: e.g. the ablations in Table 3(b) effectively demonstrate that each component—Context Tokens (CT), Top Candidate (TC) re-ranking, and Spatial Coherence (SC)—provides a measurable benefit. 1. **Concerns on Architectural Generalizability:** The method's generalizability is a key concern, as its effectiveness appears tightly coupled to the specific architecture and training objective of ViT-based CLIP models. It remains unclear if the CoMap framework can be extended to convolutional backbones, such as the convnext_large_d_320 variant of OpenCLIP. Furthermore, the "negation trick," which seems tailored to correct an artifact of CLIP's contrastive loss, may not be relevant or effective for models trained with different objectives, like the sigmoid-based loss used in ViT-SO400M-14-SigLIP-384. I would like to see experiments with these different variants. 2. **Missing quantitative ablation:** 1. *Negation on/off*—The negation step is central to making CoMap work (Section B.2, Figure 11). However, there is no mIoU/oIoU ablation showing performance with versus without this negation. This is a critical piece of evidence needed to validate the component's true impact. 2. *No “no-clustering” baseline*—How does the method perform if masks are scored directly with the raw CoMap (no connected-component clustering)? 3. **Confounded main comparisons:** The evaluation of CoPatch against key baselines like HybridGL and Global-Local is unfair. CoPatch is benchmarked using mask proposals from Mask2Former, while the baselines rely on proposals from SAM or FreeSOLO. Since the choice of mask generator significantly influences performance, this discrepancy prevents a fair assessment. While the appendix (Section C.2) attempts to address this issue, the analysis is undermined by the use of different evaluation metrics. Specifically, the results in Table 1 are reported in mIoU, whereas Table 6 uses oIoU, rendering the tables incomparable. To ensure a fair and transparent evaluation, the data from Tables 1, 6, and 7 should be reorganized into a unified analysis that uses a consistent metric across all methods. This direct comparison is crucial for understanding the true performance of CoPatch and should be presented in the main body of the paper—either in the primary results table or as a dedicated ablation study—rather than being relegated to the appendix. 4. **Lack of Discussion of Relevant Works:** The paper omits discussion and comparison with recent, highly relevant works—a significant oversight for a SOTA-focused paper. These works must be addressed in the related work section and included in the Table 2 comparisons. 1. *Feature Design for Bridging SAM and CLIP (Ito et al., WACV 2025)*—A trained CLIP↔SAM bridge that mirrors CoPatch’s proposal + CLIP scoring pipeline. 2. *Mask Grounding for Referring Image Segmentation (Chng et al., CVPR 2024)*—Supervised method that aligns word tokens to pixels, similar to same fine-grained text-to-pixel goal as CoPatch. 5. **Minor nits**: 1. Figure 13 Caption: "CLIP ViT-B/316" is a typo for "ViT-B/16". 2. Consistency: The method name should be unified (e.g., CoPatch without dash in the title vs. Co-Patch in Figure 5 and Figure 7). See weaknesses Lightly AI-edited
PreviousPage 1 of 1 (3 total rows)Next