ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (25%) 4.00 3.00 1352
Moderately AI-edited 1 (25%) 6.00 3.00 1687
Lightly AI-edited 1 (25%) 4.00 4.00 1920
Fully human-written 1 (25%) 4.00 3.00 2196
Total 4 (100%) 4.50 3.25 1789
Title Ratings Review Text EditLens Prediction
C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes C3-OWD, a curriculum cross-modal contrastive learning framework that integrates the advantages of a two-stage paradigm. While evaluated on datasets like FLIR, COCO, and LVIS, the limited scope of the experimental design challenges the broader domain relevance and generalizability of the findings. 1.The authors provide detailed theoretical analysis for the EMA mechanism with explicit bounds on parameter lag, function consistency, and loss preservation, which strengthen the credibility of the catastrophic forgetting mitigation claim. 2.The authors provide a thorough ablation study dissecting the contribution of each major model component. 3.The results show that the paper' method achieves leading performance in both multimodal robustness and open-vocabulary detection. 1.The experimental support is insufficient. Although strong performance is reported on COCO and LVIS, these datasets mainly cover standard scenarios and do not verify generalization in extreme scenarios.The current experiments primarily rely on a single robustness dataset, lacking validation on other extreme environment datasets. This makes it difficult to fully demonstrate the method's robustness across different extreme scenarios. 2.The computational complexity of the proposed framework is a concern. The combination of bidirectional RWKV blocks, MoCo-style queues, and multi-stage training likely incurs significant overhead. 3.The primary contribution of this work lies in the effective integration of several existing techniques(curriculum learning, cross-modal contrastive objectives, RWKV-based fusion, and EMA). 1.Can the authors supplement more experiments on both robustness and generalizability. It would be valuable to include experimental results validating the proposed method on object detection datasets under various adverse conditions.Here are some common object detection datasets for extreme environments: DAWN, RTTS, VisDrone, and SUIM Dataset. 2.Can the authors provide detailed evidence showing that the EMA mechanism indeed effectively preserves Stage 1 performance during Stage 2 training? The paper only provides theoretical derivation without actual evidence. Fully human-written
C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes C3-OWD, a curriculum cross-modal contrastive learning framework for open-world object detection. The approach adopts a two-stage training strategy: 1. Multi-modal robustness enhancement by pretraining on RGBT datasets to improve feature stability. 2. Vision-language generalization alignment through semantic enhancement, text-modulated deformable attention, and a bi-momentum contrastive alignment mechanism. The paper identifies a meaningful gap between robustness (RGBT-based detection) and open-world generalization (vision-language models), proposing a curriculum-style framework to bridge the two. The method is evaluated on three benchmarks with solid ablation studies, confirming the contribution of each component. 1.The core ideas are all adaptations of existing paradigms (Deformable-DETR + CLIP + MoCo). The contribution seems incremental, with limited conceptual advancement. 2.The method section is dense and symbol-heavy, especially in Stage 2. Many notations are introduced abruptly without sufficient intuition, making it difficult to follow for readers not already familiar with Deformable-DETR. 3.The training pipeline seems computationally expensive, but no runtime or cost analysis is reported. 4.Several hyperparameters (e.g., IoU threshold = 0.3) are presented without explanation or ablation. See weaknesses. Heavily AI-edited
C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes C3-OWD, a two-stage framework that aims to combine environmental robustness (via RGB-Thermal fusion) with open-vocabulary object detection. In Stage 1, visible and thermal features are fused through VRWKV blocks on the FLIR dataset to enhance robustness under varying illumination. Stage 2 integrates CLIP-based vision-language alignment and momentum contrastive learning on COCO and LVIS to support novel category detection. An exponential moving average (EMA) mechanism is introduced to reduce catastrophic forgetting across stages, with accompanying theoretical bounds. Experiments report 80.1 AP50 on FLIR, 48.6 AP50_Novel on OV-COCO, and 35.7 mAPr on OV-LVIS. While the results on OV-COCO outperform some ResNet-50 baselines, the evidence does not fully support the claim of “breaking the robustness–generalization trade-off.” The method performs below MMFN (81.8) on FLIR and CoDet (37.0) on OV-LVIS, suggesting it lies along the same trade-off frontier rather than surpassing it. The experimental coverage across FLIR (robustness), OV-COCO, and OV-LVIS (generalization) is appropriate and shows an effort to evaluate both aspects of the claimed contribution. The theoretical analysis of the EMA mechanism provides mathematical grounding, which adds credibility to the catastrophic forgetting mitigation claim. The model achieves competitive results on OV-COCO (48.6 AP50_Novel), showing improvement over earlier ResNet-50-based approaches like CLIPSelf (44.3). The inclusion of algorithmic details and ablations in Tables 3 and 4 shows an attempt at transparency and reproducibility, though key omissions remain. DO the (i) where L is the sequence length .... AND (ii) We then perform L rounds ..... Do they have to be the same, or different? Also, what is the value of L? Please follow the strengths and weaknesses. Also, How should the ablation with and without the EMA mechanism be seen? Lightly AI-edited
C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a curriculum cross-modal contrastive learning framework that addresses the dual challenges of poor generalization and limited robustness in real-world object detection. The approach integrates the strengths of visible-infrared and open-world detection through a two-stage training strategy: Stage 1 enhances robustness via RGBT pretraining, while Stage 2 improves generalization through vision–language alignment. An exponential moving average mechanism is introduced to mitigate catastrophic forgetting between stages, theoretically ensuring performance preservation with bounded parameter drift. Experiments demonstrate strong performance in both robustness and diversity benchmarks. The authors propose a cross-modal curriculum learning framework that unifies RGBT robustness and open-vocabulary generalization, dynamically balancing multi-modal information through progressive learning. 1. The proposed framework includes multiple functional modules, but their motivations are not clearly explained. For example, it remains unclear what specific problem the bidirectional attention mechanism in Stage 1 addresses and why it offers advantages over the original version. The description of the method is overly brief and difficult to comprehend. 2. In line 325, what are the specific manifestations of catastrophic forgetting? What causes it, and how does the Exponential Moving Average mitigate this issue? 3. In Table 2, why are the baselines presented inconsistently on both sides, for example, CoDet? For a fairer comparison of robustness and generalization, it would be better to provide experimental results for both datasets. Please refer to the Weaknesses. Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next