ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 3.00 2101
Heavily AI-edited 1 (25%) 6.00 2.00 4532
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 8.00 3.00 3834
Fully human-written 1 (25%) 2.00 5.00 1777
Total 4 (100%) 5.50 3.25 3061
Title Ratings Review Text EditLens Prediction
Benchmarking Open-Set Recognition Beyond Vision-Language Pre-training Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents a timely and thorough investigation into Open-Set Recognition (OSR) in the context of fine-tuned Vision-Language Models (VLMs). The authors compellingly argue that OSR remains a critical challenge even with powerful VLMs, especially when downstream task granularity misaligns with pre-training supervision. The paper studies the OSR problem by constructing the OpenVL-Bench concerning four key axes, experiments and analysis on CLIP-based and diffusion-based methods are delivered. 1. The paper is well organized. 2. The benchmark is meaningful for the VLMs. 1. The difference between the proposed benchmark and existing OSR benchmarks should be detailed analyzed. Why existing OSR benchmarks could not satisfy the evaluation of VLMs in OSR? The necessity of the proposed benchmark is worth demonstrated. 2. Since there are various fine-tuning for CLIP-based VLMs beyond Fig.2 (a)(b), in the experiments, whether the study of VLMs in OSR would be impacted by the fine-tuning methods, which potentially leads to biased analysis. 3. Analysis of the hardness of the proposed benchmark, and the relation between the methods and hardness should be delivered. 4. Discriminative models are known to be poorly calibrated, which is a root cause of their unreliable confidence scores for OSR. The observed instability of CLIP-based methods could be linked to this. A brief discussion on model calibration and how the likelihood-based approach might lead to better-calibrated scores would add depth to the analysis. 5. Does label granularity (fine vs. coarse) affect CLIP-based and diffusion models differently? Results on iNaturalist (genus-level) vs. VisA (binary normal/defective) are presented but not analyzed for granularity-specific trends. See the Weaknesses. Fully human-written
Benchmarking Open-Set Recognition Beyond Vision-Language Pre-training Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces OpenVL-Bench, a large-scale benchmark designed to systematically evaluate Open-Set Recognition (OSR) capabilities of Vision-Language Models (VLMs) under few-shot fine-tuning. The authors compare two paradigms of adapting VLMs for OSR: Discriminative approaches (e.g., CLIP-based fine-tuning with prompt learning and adapters), and Likelihood-based approaches (in particular, the proposed SD-LRT, a Stable Diffusion-based Likelihood Ratio Test). Their results show that while discriminative CLIP-based methods perform well in closed-set classification, they exhibit unstable and often unreliable open-set behavior. In contrast, SD-LRT demonstrates stronger robustness and consistency across datasets and shot numbers, especially in moderate- and high-shot regimes. The paper offers a thorough evaluation across 60 OSR tasks spanning diverse domains and label granularities, contributing valuable insights into the adaptability and failure modes of modern VLMs. 1. OpenVL-Bench is a significant contribution, covering fine-grained and domain-specific datasets (FGVCAircraft, iNaturalist, Plant Disease, and VisA) with multiple difficulty and granularity settings. 2. The SD-LRT method effectively adapts generative diffusion models for OSR, leveraging likelihood-based reasoning rather than discriminative similarity. This provides a new probabilistic perspective on open-set evaluation within the VLM context. 3. The experimental results are extensive and well-supported. The authors systematically vary label similarity (intra- vs. inter-family), sample sizes (1–16 shots), and dataset domains, and provide comprehensive evaluations using standard OSR metrics such as AUROC and macro F1-score. 1. The likelihood-based SD-LRT method is computationally expensive limiting its applicability in real-world scenarios. In particular, although the authors provide the formulation for computing L_DM in Equation (6), they do not specify the exact value of T (e.g., whether T=1000 or another value). Moreover, the paper does not analyze how the choice of T influences the final results, which is important for understanding the efficiency–accuracy trade-off. 2. While various diffusion models exist, the authors only report results based on Stable Diffusion 2.0. It remains unclear whether the proposed SD-LRT framework generalizes to other diffusion architectures. 3. In the implementation section, the authors state that they used the model “with and without LoRA,” but it is not clearly explained what this distinction entails. The experimental results do not analyze how the LoRA rank affects performance. 4. Several important training configurations are not reported, such as the number of training iterations, learning rate, and optimizer type. 1. The description of Equation (7) is somewhat confusing. The equation seems to imply that if the class-conditional generation result is closer to the unconditional generation, it indicates a weaker match between the given class condition and the input image. However, it is unclear how the authors justify that the unconditional generation result is necessarily close to that obtained under an incorrect class condition. In other words, as the denoising process proceeds, the diffusion model may already be capable of producing high-quality reconstructions even without any conditioning at the later timesteps. How do the authors account for or eliminate the influence of this effect in their formulation? 2. What changes occur in the image generation quality of SD-LRT before and after LoRA fine-tuning? The authors are encouraged to provide both qualitative and quantitative evaluations. 3. For likelihood-based approaches like SD-LRT, if new classes are later labeled and added to training, can the model be fine-tuned without suffering from catastrophic forgetting? Lightly AI-edited
Benchmarking Open-Set Recognition Beyond Vision-Language Pre-training Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper revisits open-set recognition (OSR) in the era of vision–language models (VLMs). Although VLMs exhibit open-vocabulary capabilities, the authors show that few-shot fine-tuning often breaks their ability to reject unknown classes. To systematically study this issue, they introduce \textit{OpenVL-Bench}, a benchmark with 60 OSR tasks across four datasets, varying label granularity, semantic difficulty, and sample size. The paper compares discriminative CLIP-based fine-tuning (modeling $P(Y \mid X)$) with a generative diffusion-based approach (modeling $P(X \mid Y)$) using a likelihood-ratio test (SD-LRT). Experiments demonstrate that SD-LRT yields more stable and semantically aligned open-set behavior than discriminative methods, particularly when the number of shots exceeds four. The study concludes that OSR remains a critical, unresolved challenge despite modern open-vocabulary pre-training. - The paper addresses an important and underexplored question—whether open-vocabulary pre-training in vision–language models (VLMs) inherently solves open-set recognition (OSR). The authors provide a clear motivation and reveal that few-shot fine-tuning can compromise a model’s ability to reject unknown classes, which is both practically and theoretically insightful. - The proposed OpenVL-Bench systematically covers 60 OSR tasks across four datasets with varying label granularity, semantic hardness, and sample regimes. The benchmark design is transparent, well-documented, and supports fair cross-method comparison, making it a valuable community resource. - The study contrasts discriminative CLIP-based fine-tuning (modeling $P(Y \mid X)$) with a generative diffusion-based approach (modeling $P(X \mid Y)$) using a likelihood-ratio test (SD-LRT). The comparison is both conceptually grounded and empirically thorough, providing clear evidence for the stability advantages of likelihood-based scoring. - Results across multiple datasets demonstrate consistent trends: SD-LRT yields smoother AUROC improvements and semantically aligned rejection behavior. The paper also provides qualitative visualizations that intuitively support its quantitative conclusions. Overall, the work is well-written, logically organized, and convincingly argued. - The proposed SD-LRT requires training a separate LoRA module for each class and performs diffusion-based inference, which is substantially more expensive than CLIP-based fine-tuning. However, the paper does not provide quantitative metrics such as training/inference time, GPU memory consumption, or computational complexity (e.g., FLOPs). Moreover, scalability with respect to the number of classes remains unclear, leaving the practical deployability of SD-LRT uncertain. - The comparison focuses mainly on earlier CLIP-based baselines such as CoCoOp, MaPLe, and Tip-Adapter, while omitting more recent works that specifically target open-set robustness after few-shot tuning. Notably, "ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection (Bai et al., CVPR 2024" and "Out-of-Distribution Detection with Negative Prompts (Nie et al., ICLR 2024" have demonstrated strong improvements in few-shot OOD detection. Including these methods in the OpenVL-Bench evaluation—or at least providing a conceptual discussion of how SD-LRT complements or differs from them—would significantly strengthen the empirical comparison. - Diffusion models are inherently trained for pixel-level generative reconstruction, which allows them to capture richer fine-grained visual cues compared to CLIP, whose pre-training objective focuses on image-level alignment. As a result, part of the observed improvement in SD-LRT may stem from this intrinsic modeling advantage rather than the proposed likelihood-ratio mechanism itself. The paper does not explicitly analyze or control for this factor, leaving open the question of how much gain comes from the diffusion backbone versus the LRT formulation. - In Eq.(1), the discriminative fine-tuning objective combines the original and the updated visual encoder representations, i.e., the input to the text encoder is the residual connection $\alpha V_{\theta}(x) + V(x)$. However, in Figure 2(a), the diagram seems to show only the updated encoder $V_{\theta}(x)$ being used, without the skip connection from the frozen encoder. Could the authors clarify whether the residual fusion with the original encoder is actually applied during training and inference, and if so, please confirm that the implementation matches Eq.(1)? Heavily AI-edited
Benchmarking Open-Set Recognition Beyond Vision-Language Pre-training Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The work introduces OpenVL-Bench, a suite of 60 OSR tasks with varying label granularity and closed/open-set similarity, and compares discriminative fine-tuning (prototype/similarity based) with likelihood-based approaches using diffusion VLMs. The proposed SD-LRT uses a likelihood-ratio based on conditional vs unconditional Stable Diffusion losses, showing strong OSR performance across diverse domains, including industrial anomaly detection. The paper also notes that in extreme low-shot (1–2 shots), CLIP-style discriminative methods can win. 1. The benchmarking study offers a comprehensive and balanced evaluation of discriminative versus likelihood-based open-set recognition, with practical insights for low-shot learning scenarios. 2. Its methodology is clearly explained: the proposed SD-LRT approach is grounded in Neyman–Pearson optimality derived from likelihood ratio tests, and integrates seamlessly into diffusion-based vision-language models. 3. The method also delivers strong performance on industrial anomaly detection, as shown on the VisA dataset, accompanied by interpretable heatmap visualizations. 1. The computational cost of diffusion-based likelihood scoring remains a concern. The study does not provide end-to-end latency measurements on current hardware, nor does it explore more efficient approximations. 2. SD-LRT relies on large pre-trained diffusion models, yet their memory footprint and deployment constraints receive limited discussion. 1. Will the OpenVL-Bench tasks, data splits, and implementation code—including scripts for SD-LRT—be made publicly available? If so, what is the expected timeline? 2. It would also be helpful to see throughput and latency comparisons between SD-LRT and discriminative baselines under identical hardware settings, as well as an analysis of whether fewer diffusion steps could balance speed and accuracy. 3. In the context of 1–2 shot learning, could a hybrid approach—using a discriminative model for short-listing and SD-LRT for re-ranking—improve performance? Empirical validation would make this idea more compelling. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next