ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 4.00 5782
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 4.00 2257
Fully human-written 1 (25%) 4.00 4.00 3567
Total 4 (100%) 4.50 4.00 4347
Title Ratings Review Text EditLens Prediction
Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification via Sparsity-Guided Steering Vectors Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents Visual Sparse Steering (VS2) and its retrieval-augmented variant, VS2++, to steer vision foundation models at test time. The former method, VS2, constructs a steering vector by upweighting sparse features extracted from a sparse autoencoder (SAE) trained on the model's internal activations. This label-free, test-time method is empirically shown to improve CLIP zero-shot classification across three different datasets. Furthermore, in a relaxed setting where retrieval from an external unlabeled corpus is allowed, the proposed VS2++ constructs a steering vector from contrastive examples built using pseudo-labels, resulting in a large boost in classification accuracy. The authors also claim that VS2 includes a built-in reliability diagnostic which is not present in common steering vectors: the SAE's reconstruction loss is shown to correlate with classification performance. - **Novel Approach:** The proposed VS2 is interesting, as steering vectors are constructed by simply upweighting all sparse features but results in performance gain. Unlike conventional steering vectors, VS2 is label-free method avoiding the need for contrastive data. - **Consistent Empirical Improvements:** The proposed VS2++ method demonstrates consistent and notable empirical improvements over the strong CLIP zero-shot baseline across all three evaluated datasets (CIFAR-100, CUB-200, and Tiny-ImageNet). - **Practical Reliability Diagnostic:** The paper also includes discussion about having reliability diagnostic, an overlooked but important issue in practical applications. - **Overclaimed Scope of Contribution:** The paper's central claim is to "steer vision foundation models", but the empirical validation is exclusively limited to improving the zero-shot classification accuracy of CLIP models. There is no evidence provided that this method can generalize to other vision foundation models or other tasks beyond classification. This discrepancy makes the general claim of "steering" appear to be an overclaim. The method is more accurately framed as a test-time adaptation technique for CLIP zero-shot classification. - **Misleading "Test-Time" Claim and Practicality Issues:** The method is presented as a "test-time" approach, but this is a bit misleading. Its effectiveness hinges on a pretrained SAE that must be trained on the in-domain activations of the target dataset's training split. This requirement has several critical weaknesses: - The method cannot work as a "plug-and-play" method, as it requires substantial, dataset-specific computation and full access to the (unlabeled) training set before any test sample can be processed. - The authors' own experiment with a "generalized SAE" (trained on the union of datasets) in Section 4.4 confirms this limitation, as it fails to generalize and underperforms the baseline on CUB-200. This reinforces the method's strong dependency on in-domain data. - This "pre-training" step on the training set blurs the line between a true test-time method and a form of unsupervised domain adaptation. - **Contradictory Premise for Data-Scarce Scenarios:** The chosen experiment setup, CLIP zero-shot classification, is most valuable in data-scarce (e.g., few-shot or long-tail) domains where training data is limited. However, these are precisely the scenarios where VS2/VS2++ would be least applicable. - VS2 would fail because there is insufficient in-domain data to train a reliable SAE. - VS2++ would be even more likely to fail, as it requires a large high-quality retrieval corpus, which is unlikely to exist in a data-scarce setting. - **Unvalidated Reliability Diagnostic:** The "built-in reliability diagnostic" is presented as a key advantage but is not empirically validated as a practical tool. The analysis in Section 4.4 and Table 3 only demonstrates a dataset-level correlation (i.e., CUB-200 has high average FVU and low accuracy). It fails to provide evidence that this works on a per-instance basis, which is necessary for it to function as a useful fallback mechanism. - **Generalizability:** Given the claim of "steering vision foundation models," could the authors elaborate on a concrete way to apply VS2 to (a) non-CLIP models like DINOv2 and (b) non-classification tasks like VQA or image generation, where the concept of a single "CLS token" or "classification accuracy" may not directly apply? - **Sensitivity to Data Scarcity:** How does the SAE's reconstruction error (and the final VS2 accuracy) degrade as the proportion of the training set used for SAE training is reduced? Would VS2 work well under data-scarce setting where CLIP zero shot classification is often considered? - **Missing Experimental Details:** Could the authors please clarify the setup for the "CLIP classifier" used for pseudo-labeling? Is this a standard zero-shot classifier? If so, what text prompts were used? These details are essential for reproducibility. - **Validating the Fallback Mechanism:** To validate the *per-instance* utility of the reliability diagnostic, would it be possible to bin the test instances by their reconstruction loss? A plot showing accuracy per bin would confirm if instances with higher loss are indeed those where VS2 underperforms. Following this, have the authors experimented with implementing the actual fallback mechanism (i.e., if loss > threshold, use the baseline prediction)? This would show if the diagnostic practically improves the final accuracy. - **More Baselines for VS2++:** For the VS2++ setting, where access to the unlabeled training set is assumed, there could be simpler but potentially stronger baselines. How does VS2++ compare against: - (a) A simple k-NN classifier using retrieval (i.e., a majority vote of the pseudo-labels from the top-k retrieved neighbors)? - (b) A lightweight linear probe trained on the training set's CLS tokens and their corresponding pseudo-labels (or in oracle setting, labels)? Fully AI-generated
Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification via Sparsity-Guided Steering Vectors Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents an unsupervised steering technique (VS2) to adapt vision models. The authors mainly test the approach on 3 common benchmarks using the CLIP ViTs backbones. **S1.** The paper treats a relevant problem, i.e., adapting foundation models (FMs) for downstream classification tasks. **S2.** The steering approach, as substantiated by the comparison shown in Table 1, appears to be more efficient than the baseline reconstruction task, making the idea interesting. I also find the analysis shown in Section 4.3 (Table 2) insightful. From these findings, it indeed appears that steering removes some ambiguity among samples of different classes that share common features. **S3.** The paper reads well, it is quite systematic in terms of analysis, and presents results and figures in a clear manner. **W1.** Lacking related work: In my opinion, the authors do not perform a good job in locating the problem that they are trying to solve, and mostly focus on a particular technique derived from the natural language literature, but in practice are solving an unsupervised domain adaptation (UDA) problem: the encoder is a frozen generalist model (CLIP) that must be adapted to a particular classification task without using any labels. The current related work section does not mention any literature from UDA, despite its strong relevance. **W2.** Missing unsupervised adaptation baselines: The paper, which primarily tests the approach on a UDA problem for pre-trained FMs, lacks a comparison against popular baselines such as test-time augmentation (TTA) (fully training-free). **W3.** Missing unsupervised adaptation datasets: Typically, test-time adaptation addresses distribution shifts. CIFAR and TinyImagenet are surely ID for CLIP pre-training data. CUB could arguably be defined to be OOD since it contains several fine-grained bird species. **W4.** Missing quantification of “lightweight” adaptation: The paper claims that the adaptation is lightweight, but it lacks proper comparison and quantification to support this claim. **W5.** Missing reliability implementation: The possibility of checking the reconstruction loss to assess the “reliability” of the prediction in cases of OOD samples makes sense (Section 4.4). However, the authors, if I understand correctly, did not implement this mechanism in practice, making the claim somewhat vague and untested. **W6.** Writing style: while asking questions usually raises the reader's interest, using them too frequently, as is done especially in the introduction, may end up sounding verbose and repetitive. This weakness is minor. **Q1.** What are the differences between your approach and the UDA literature? An additional related work paragraph would clarify this point. **Q2.** Does VS2 work better than TTA? The latter does not need any training. **Q3.** Does VS2 also work well for datasets that are more OOD than the tested ones (they are arguably ID)? **Q4.** How lightweight is VS2 compared to full-finetuning/LORA adaptation? Clearly, you are not using labels, but in terms of FLOPS. I also didn’t manage to find specifics regarding the size of the SAE (possibly missed this detail). **Q5.** How would you actually implement the reliability check using the reconstruction loss? This would likely add an additional hyperparameter. Would this prevent performance loss (e.g., see CUB)? In summary, I found the paper interesting, but in its current form, it is more suitable for a workshop than a conference. Welcoming feedback from the authors during the discussion period. Fully human-written
Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification via Sparsity-Guided Steering Vectors Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Visual Sparse Steering (VS2), a test-time method that improves zero-shot image classification by steering vision models using sparse features from an autoencoder trained on internal activations—no labeled data required. It amplifies the most salient sparse features identified by the autoencoder to shift embeddings in a meaningful direction, boosting accuracy on CIFAR-100, CUB-200, and Tiny-ImageNet. VS2++ extends this by using unlabeled external images to selectively amplify only the most discriminative features via retrieval, achieving much larger gains when high-quality neighbors are available. Crucially, VS2 includes a built-in reliability check: if the autoencoder’s reconstruction error is high, it falls back to the original model, avoiding harmful steering—a feature absent in prior methods. 1. VS2 introduces a novel, label-free test-time steering method for vision models using sparse autoencoder (SAE) features, achieving consistent improvements over CLIP zero-shot. 2. VS2++ extends this with retrieval-augmented selective amplification, demonstrating substantial accuracy gains (up to 21.44% on CIFAR-100) under oracle conditions. 3. Empirical results show robust improvements across diverse datasets (CIFAR-100, CUB-200, Tiny-ImageNet) and ViT backbones. 4. The paper provides thorough ablation studies, hyperparameter sensitivity analyses, and qualitative evidence that SAE latent features capture meaningful visual concepts. 1. VS2++'s gains (up to 21.44%) rely on oracle positive/negative sets, which are unrealistic in practice; performance drops substantially with noisy pseudo-labels, revealing a critical dependency on high-quality retrieval. 2. The method assumes a **well-trained, in-domain** Sparse Autoencoder (SAE) is available, limiting real-world applicability. 3. While the paper shows sparse features capture semantically meaningful attributes, there's no clear evidence that these features are helpful for classification, i.e., the SAE's reconstruction objective may not inherently capture discriminative visual concepts. 4. The reliance on DINOv2 for retrieval in VS2++ introduces an external, non-trainable component that adds computational overhead and potential bias. See the weakness section. Lightly AI-edited
Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification via Sparsity-Guided Steering Vectors Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Visual Sparse Steering (VS²), a test-time adaptation method for vision foundation models that constructs steering vectors from sparse features extracted by Sparse Autoencoders (SAEs) trained on internal activations. The core insight is that SAEs can identify salient, non-redundant features that, when amplified, improve zero-shot classification without requiring labeled contrastive examples. The authors propose two variants: VS² operates label-free by uniformly amplifying sparse features, while VS2++ selectively weights features using pseudo-labeled neighbors from an external corpus. Experiments on CIFAR-100, CUB-200, and Tiny-ImageNet demonstrate consistent but modest gains (1-4%) over CLIP baselines with VS², and substantial improvements (up to 21.44%) with VS2++ under oracle conditions. The paper also introduces a reliability diagnostic based on SAE reconstruction loss to detect when steering may be harmful. 1. The paper addresses an important gap by extending steering vectors to the vision domain without requiring explicit positive/negative examples, which is non-trivial given the entanglement and redundancy of visual representations compared to language tokens. 2. Unlike conventional steering methods, VS² provides a natural fallback mechanism through SAE reconstruction error (Table 3), which signals when test inputs are out-of-distribution and steering should be avoided. This is a practically valuable contribution for deployment scenarios. 3. The sensitivity analyses, ablations on expansion factor and sparsity (Table 7), and per-class accuracy breakdowns (Table 2) provide good evidence that the method is robust to architectural choices and reveals which categories benefit most from sparse steering. 4. The oracle experiments with VS²⁺⁺ (21.44% gains on CIFAR-100) effectively demonstrate the potential ceiling of the approach and motivate future work on better feature selection mechanisms. 5. The paper honestly discusses the reconstruction-vs-alignment tension (Appendix K, Table 10) and acknowledges that features learned for autoencoding are not necessarily optimal for classification, particularly on fine-grained datasets. 1. The paper primarily compares against CLIP zero-shot and reconstruction baselines but lacks comparisons with other test-time adaptation methods (e.g., TPT, DiffTPT, or prompt-based adaptation techniques). The single comparison with SpLiCE (Table 8) is helpful but insufficient to position the work within the broader test-time adaptation literature. 2. While VS² shows consistent improvements on CIFAR-100 (3-4%), the gains on CUB-200 (0.93-1.08%) and Tiny-ImageNet (1.5-1.84%) are marginal. The paper hypothesizes this stems from features learned for reconstruction not aligning with classification needs, but does not provide sufficient analysis or solutions beyond the prototype-based approach in Appendix K, which requires labels during SAE training. 3. The paper states that VS² is "lightweight" but provides no runtime or memory comparisons. Training SAEs requires processing all layers across entire datasets, and the inference-time reconstruction steps add overhead. Quantitative efficiency analysis relative to the zero-shot baseline would strengthen practical applicability claims. 4. While the geometric intuition is clear (amplify sparse features -> steer toward salient directions), the paper lacks formal analysis of why autoencoding objectives should discover classification-relevant features, or under what conditions this alignment holds. The empirical observation that it works on CIFAR-100 but struggles on CUB-200 suggests important gaps in understanding. 5. Evaluation scope limitations: All experiments use CLIP ViT backbones; generalization to other architectures (ConvNets, other VLMs) is unexplored. Only classification tasks are evaluated; applicability to detection, segmentation, or other vision tasks remains unclear. The VS²⁺⁺ non-oracle results (Table 1) often show smaller gains than standard steering vectors, undermining claims about SAE superiority when supervision is noisy 6. The term "contrastive data" is used loosely (the method doesn't require labeled positives/negatives but still uses contrastive loss in training). Section 3.2.2 could better explain how pseudo-labeling quality affects VS²⁺⁺ performance and why negative groups from top-N neighbors constitute "hard" cases. 1. Can you provide wall-clock time and memory comparisons between VS², standard CLIP inference, and other test-time adaptation methods? How does SAE training time scale with dataset size and model depth? 2. Table 12(b) and 12(d) show classes where VS² hurts performance. Can you characterize these failure modes? Are there systematic patterns (e.g., texture-based vs. shape-based classes, frequency in training data) that predict when steering will be harmful beyond reconstruction error? 3. Have you tested VS² with other vision encoders (DINOv2, MAE, supervised ImageNet models)? Does the method's effectiveness depend on CLIP's specific training objective or architecture? 4. How does VS² compare with learnable prompt methods (CoOp, CoCoOp) or adapter-based approaches that also aim to improve zero-shot performance? These methods require some labeled data but might provide more principled comparisons than pure zero-shot. 5. The prototype-alignment approach (PASS) in Appendix K shows promise but requires labels. Could a clustering-based approach discover multiple prototypes per class in an unsupervised manner? Have you explored hierarchical or mixture models for class representations? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next