ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 3 (75%) 4.00 4.00 2169
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 6.00 2.00 2454
Total 4 (100%) 4.50 3.50 2240
Title Ratings Review Text EditLens Prediction
Towards Adversarially Robust CLIP: A Hierarchical Model Fusion Method Using Optimal Transport Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes HOT-CLIP (Hierarchical Optimal Transport Fusion for CLIP), a method to enhance adversarial robustness of vision-language models. The approach first trains diverse submodels using different adversarial attacks (FGSM, PGD, MIM) and textual prompts, then fuses them using a two-level hierarchical optimal transport method. The first level (intra-attack) fuses models within the same attack family but different prompts, while the second level (inter-attack) combines these fused models across different attacks. Experiments on image classification, VQA, and image captioning demonstrate improvements in adversarial robustness while maintaining competitive clean accuracy. + First work to systematically apply optimal transport-based model fusion to adversarial robustness in multimodal models, addressing a well-motivated problem. + Experiments span multiple tasks (classification, VQA, captioning) and datasets, with consistent improvements shown across different perturbation budgets. + The hierarchical fusion strategy reduces memory requirements from O(|A||T|·U) to O(max{|A|,|T|}·U), making the approach more deployable. + Strong empirical results: Relative improvements of ~2.6% (classification), ~20.4% (VQA), and ~16.5% (captioning) in robust accuracy over baselines. + The paper can use more theoretical analysis or deeper insights into when and why the geometric alignment via OT succeeds for adversarially diverse models. + While inference is efficient, training requires multiple submodels. + AutoAttack is the only adversarial evaluation method used. + Why does hierarchical fusion work better than direct fusion? + What is the total training time/cost compared to baselines? Is this practical for larger models? + The method shows noticeable drops in clean accuracy compared to vanilla CLIP (74.9→69.9 on ImageNet). Is this tradeoff acceptable? Fully AI-generated
Towards Adversarially Robust CLIP: A Hierarchical Model Fusion Method Using Optimal Transport Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new framework HOT-CLIP to enhance the adversarial robustness of large vision-language models like CLIP. Standard adversarial training often overfits to specific attack types and that ensembling multiple adversarially trained submodels improves robustness but is computationally expensive. To address this trade-off, they introduce HOT-CLIP, which constructs diverse CLIP submodels by varying both attack methods and textual prompts, and then fuses them using a two-stage hierarchical optimal transport method. Experiments on zero-shot image classification, image captioning, and visual question answering show that HOT-CLIP substantially improves adversarial robustness while maintaining competitive performance on clean data. 1. The proposed hierarchical strategy effectively alleviates neuron misalignment issues that often arise when fusing diverse models, outperforming naive averaging, or single-level OT methods. 2. Although multiple adversarial sub-models are required during training, only a single fused model is needed for inference, significantly improving efficiency. 3. Comprehensive experiments demonstrate that the proposed method remains robust across multiple tasks and attacks while maintaining competitive performance on clean data. 4. The fused visual encoder can be directly used in multimodal LLMs such as LLaVA-1.5 and OpenFlamingo. 1. The method has high complexity, as demonstrated in Appendix C.1, Hierarchical OT fusion iteratively computes the cross-layer transfer matrix, aligns neurons, and averages the aligned weights to obtain the fused representation, which limits its practical value. 2. Although inference remains effective, this method requires training multiple adversarial sub-models under various attacks and prompts, resulting in significant computational and resource costs during the training process. 3. The application seems limited as only with CLIP. The authors demonstrate the application of a fusion visual encoder to LLaVA-1.5 and OpenFlamingo, but its transferability to other multimodal architectures remains undiscussed. 1. It would be helpful if the authors could provide analysis of computational resources, e.g., training time, GPU memory, etc., to better evaluate the practicality of the proposed method. 2. Does this method still apply to different LVLMs, and whether additional fine-tuning required? Or can the proposed fusion be directly applied to a purely visual encoder? Fully human-written
Towards Adversarially Robust CLIP: A Hierarchical Model Fusion Method Using Optimal Transport Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes HOT-CLIP, a hierarchical optimal transport (OT) based fusion framework to improve the adversarial robustness of multimodal models such as CLIP. The method first performs intra-attack fusion to align submodels within the same attack type, then inter-attack fusion to combine across attack families. The goal is to achieve a balance between robustness and efficiency without increasing inference-time cost. Experiments on several vision–language tasks show improvements in adversarial robustness while maintaining clean accuracy. 1. The hierarchical two-level OT fusion is clearly described. 2. The empirical results are clearly reported and show consistent improvement on benchmarks. 1. There is a lack of novelty in the proposed work. The proposed method is largely a direct application of existing optimal transport (OT) fusion techniques, with limited methodological innovation or new theoretical contribution. 2. There is no theoretical justification. The paper lacks formal analysis or theoretical guarantees explaining why the hierarchical OT fusion improves robustness or parameter alignment. 3. The computational analysis is missing. There is no discussion or experiment on computational cost, including the memory and runtime implications of the hierarchical fusion. 4. The insight is limited. The results, while positive, do not provide deeper understanding of why or when the method works, reducing the paper’s scientific value. 1. Could the authors include a discussion or measurement of training and fusion cost to support the claim of efficiency? 2. What is the theoretical motivation for using OT over simpler averaging or linear fusion methods? 3. How sensitive is the hierarchical OT fusion to the choice of submodels or the diversity of attacks? Fully AI-generated
Towards Adversarially Robust CLIP: A Hierarchical Model Fusion Method Using Optimal Transport Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles the problem of adversarial robustness in multimodal models, particularly CLIP. While CLIP achieves strong performance on vision-language tasks, it remains highly vulnerable to adversarial perturbations. To address this, the authors propose HOT-CLIP, a Hierarchical Optimal Transport–based model fusion framework that enhances robustness without adding inference overhead. The method first constructs diverse adversarially trained CLIP sub-models using different attack strategies and prompt templates. Then, it performs a two-level hierarchical fusion—language-level and vision-level—using optimal transport (OT) to align and merge model parameters effectively. This hierarchical OT fusion improves alignment among heterogeneous models, achieving better adversarial robustness while maintaining clean accuracy. 1. The proposed two-stage framework—comprising sub-model generation and hierarchical optimal transport (OT) fusion—is conceptually clear and technically sound. 2. Unlike conventional ensemble-based defenses, HOT-CLIP does not introduce any additional computational overhead during inference. The fused model maintains the efficiency of a single model, which makes the method attractive for real-world deployment of large-scale vision-language systems such as CLIP. 3. The paper provides extensive experimental validation across multiple multimodal tasks, including zero-shot classification, image captioning, and visual question answering. The results consistently demonstrate that the proposed method improves adversarial robustness while preserving clean accuracy, supporting the method’s general effectiveness. 1. Although the hierarchical structure helps control memory usage, the overall pipeline involves multiple rounds of adversarial training and OT optimization. This increases implementation complexity and computational cost, which may limit practical adoption. 2. All experiments are conducted on CLIP and its variants. The absence of results on other vision-language architectures (e.g., BLIP) leaves open the question of whether HOT-CLIP generalizes beyond the CLIP family. 3. The paper primarily focuses on empirical evidence. It lacks a deeper theoretical explanation of how optimal transport specifically contributes to parameter alignment and robustness enhancement. A stronger theoretical foundation would improve the paper’s impact and clarity. 4. The evaluation is restricted to linf-bounded attacks (2/255, 4/255), without considering other perturbation types such as l2 norm I prefer to give borderline(score 5). Please see the weakness. The problem formulation is sound, and the challenge is well-motivated. However, the solution is mainly an engineering-level improvement built on existing fusion and alignment techniques, with limited theoretical support for its claimed effectiveness. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next