|
ClusCAM: Clustered Visual Explanations for Vision Models in Image Classification |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper propose ClusCAM, a gradient-free post-hoc explanation method that groups internal representations into meaningful clusters (meta-representations). The importance of each cluster is then measured through logit differences with dropout and temperature-scaled softmax, emphasizing the most influential groups.
By modeling group-wise interactions, ClusCAM generates sharper, more interpretable, and faithful explanations. The method is architecture-agnostic, working with both CNNs and Vision Transformers. Experimental results show that ClusCAM surpasses state-of-the-art interpretability techniques.
The paper proposes a new interpretability method, ClusCAM, and provides extensive experimental validation to demonstrate its effectiveness. The authors present numerous quantitative results that highlight the superiority of their approach.
The experimental setup is highly comprehensive, covering multiple datasets—including the ILSVRC2012 benchmark and an Alzheimer’s MRI dataset—and a wide range of model backbones, such as ResNet variants (ResNet-18/34/50/101), EfficientNet, InceptionNet, and various Vision Transformers (e.g., ViT-B, Swin-B, LeViT-192/256, CaiT-XXS-24, and PVTv2).
A diverse set of evaluation metrics is also employed to thoroughly demonstrate the robustness and effectiveness of the proposed method.
The paper presents extensive experiments across multiple backbones to demonstrate the superiority of ClusCAM, but it lacks comparisons with several important baseline methods:
1. Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: randomized input sampling for explanation of black-box models. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, page 151, 2018.
2. Quan Zheng, Ziwei Wang, Jie Zhou, and Jiwen Lu. 2022. Shap-CAM: Visual Explanations for Convolutional Neural Networks Based on Shapley Value. In Computer Vision–ECCV 2022: 17th European Conference. Springer, Tel Aviv, Israel, 459–474
In addition to proposing ClusCAM, which is extensively validated across multiple backbones to demonstrate its superiority, the paper offers limited insight into the underlying model behavior. Several prior studies have explored interpretability and explanation in deep models from different perspectives.
1. Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of visual transformers. arXiv preprint arXiv:2103.15670, 2021.
2. Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns? Advances in Neural Information Processing Systems, 34, 2021.
3. Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing the decision-making mechanisms by transformers and cnns via explanation methods. In IEEE Conf. Comput. Vis. PatternRecog. (CVPR), pages 9546–9555, 2024.
Are there any future plans to extend or apply this method to other tasks or domains? |
Moderately AI-edited |
|
ClusCAM: Clustered Visual Explanations for Vision Models in Image Classification |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper is aiming at enhancing the XAI of CNNs and ViTs. The idea is to cluster the representations into K groups, then upsample and normalize them to be the same size as the input, and use them as a soft mask of the input. Then aggregate the difference between the logit activation of the masked input and the benign one. This approach is treating in a weighted manner the different concepts of the input. It is like projecting the input into concepts and highlight mostly the dominant ones.
* XAI is a very important topic, especially nowadays when it is crucial to better understand the inner process of networks.
* The Related Work section is detailed and informative, and the academic gap is well explained.
* The idea of combining concept fetching with explainability is intriguing.
* The approach is written as if it can be applied for both CNNs and ViTs. One of the gaps mentioned is that current approaches is overlooking weighting when aggregating representations. While it might hold for CNNs it is not true in ViTs, see for example [1], [2] and [3]. Moreover there is a large amount of dedicated approaches for XAI specifically on ViT which is failed to be mentioned.
* The method is primarily empirical with lack of intuitive explanation on the considerations for why applying each step behind it. For example why specifically using Kmeans++ (and not any other clustering scheme?)? why the element-wise product between the meta-representation to the input represents? Why is formula 5 is reasonable? what if it would be division? why removing the r% least important meta representations? (If they are not important for the classification, then they do not have impact anyway). In general it might be OK when there is empirical progression, however in my opinion it should come along with intuitive explanation so that follow-up researcher could extend your ideas. In my opinion it is lack in the current submission.
* Most of the enhancement is of very empirical steps like filtering out less relevant projections. I believe the authors need to clearly separate the conceptual novelty from the empirical contributions and to focus more on the conceptual contribution. Most of the paper elaborate on the empirical steps, which is with less academic improvement.
Minor weaknesses:
* confusing notations. H and W represent the input dimensions while h and w represent the representation dimensions. I would select other notations to make it clearer.
* The placement of visualizations is a bit awkward. For example, the algorithm is presented before the algorithm itself is explained.
* The initials ViT is more common for Vision Transformers than VT.
* r is used both for the ratio of the dropout and for the percentage of meta-representations to filter.
* What is the meaning of the colors in Fig. 2? If it is just to indicate different meta-representations, then it is not so clear.
refs:
[1] Transformer interpretability beyond attention visualization, Chefer et al. CVPR 21.
[2] Token transformation matters: Towards faithful post-hoc explanation for vision transformer. We et al. CVPR 24.
[3] From Attention to Prediction Maps: Per-Class Gradient-Free Transformer Explanations. Schaffer et al. PrePrint
* It is known that there are polysemantic neurons. i.e., neurons which activated differently when facing different inputs [1]. How do you think your approach will be affected from this? specifically Im curious on the hard clustering step which i assume that will cause hard selection for a certain "meaning".
* The clustering stage is closely related to concept clustering which typically implemented through Sparse AutoEncoders (SAE). Have you tried implement it with SAEs ([2] for example but there are a lot of papers in this topic)? If so, then what are the results? If no, I would recommend try it since it is always better to lean on a grounded approaches.
* How does the normalization is done? Which method of Upsampling is applied?
* Why is M_j represents importance? The element-wise multiplication explained as a sort of soft masking when M is the soft mask matrix. Implicitly it is referred that the magnitude of the representation represents importance (personally I agree with this observation), but in your view, why do you think it holds?
* Why do you selected using dropout to filter out outliers? It is statistical operator, it can in some cases not filter them at all? Moreover, it is not a good practice that the inference is random (depend on the seed). It was found that "registers" might be the cause for outlier heads in ViT [3], at least for ViT it can be the starting point to find this outliers systematically instead of filter them statistically.
In general, I think that the paper is too much empirical in his nature, and the authors should clearly distill the pure conceptual contribution from the empirical steps. In some cases it is even better to get a method which is a bit less good in performance but much more understandable. In the case of your approach, I think that the approach is too empirical such that it is very hard to isolate the pure contribution. Moreover, I think that the authors should focus more on explaining the intuitions behind the approach steps, and to make it clearer what information is better seized using your approach.
refs:
[1] Interpreting the Second-Order Effects of Neurons in CLIP. Gandelsman et al. ICLR 25.
[2] Interpreting CLIP with Hierarchical Sparse Autoencoders. Zaigrajew et al. ICML 25.
[3] Vision Transformers Need Registers. Darcet et al. ICLR 24. |
Fully human-written |
|
ClusCAM: Clustered Visual Explanations for Vision Models in Image Classification |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes ClusCAM, a novel gradient-free post-hoc explanation framework designed to enhance the faithfulness and interpretability of visual explanations in image classification models. Unlike conventional CAM-based approaches that treat internal representations as independent, ClusCAM clusters them into semantically coherent meta-representations using the K-Means++ algorithm. The importance of each cluster is quantified through logit-based differences, followed by a dropout mechanism and temperature-scaled softmax to suppress irrelevant signals and highlight the most influential regions. ClusCAM is architecture-agnostic, effectively applicable to both convolutional neural networks (CNNs) and Vision Transformers (ViTs). Extensive experiments show that ClusCAM consistently outperforms state-of-the-art baselines across multiple quantitative metrics, producing sharper and more interpretable visualizations.
ClusCAM introduces a group-wise attribution strategy by clustering internal representations into higher-level meta-representations. This approach marks a significant improvement over conventional CAM methods, which assume that individual features contribute independently and with equal importance—often resulting in noisy or unreliable explanations. Furthermore, the paper presents a data-driven procedure for selecting key hyperparameters, thereby reducing the reliance on manual tuning and improving the overall stability and reproducibility of the method.
ClusCAM introduces additional computational overhead compared to highly efficient methods such as Grad-CAM. The initial K-Means++ clustering of internal representations increases inference time, particularly for large-scale models. Although ClusCAM can operate faster than exhaustive ablation-based approaches like Score-CAM and Ablation-CAM when applied to Vision Transformers (ViTs), its computational cost still limits scalability in real-time or resource-constrained environments. Moreover, while the paper proposes data-driven strategies for selecting key hyperparameters, these procedures are primarily heuristic and lack a strong theoretical foundation, leaving room for further formal analysis and optimization.
While the proposed method demonstrates several notable strengths, I have some concerns regarding its broader applicability and theoretical grounding. For instance, gradient-based analyses still provide valuable information for enhancing output probabilities, as they capture model sensitivity through backward propagation. It would be worthwhile to investigate whether ClusCAM could be integrated with gradient-based interpretability approaches, since many state-of-the-art explanation frameworks leverage both forward and backward reasoning. Although the reported results indicate that ClusCAM outperforms several contemporary baselines, a more comprehensive comparison with recent state-of-the-art model, such as Attention-Guided CAM (AAAI 2024) combining forward and backward attention mechanisms to suppress noise in Vision Transformer, would further strengthen the empirical validation of this work. Additionally, for small target objects, the hyperparameter, particularly the number of clusters, may have a significant impact on the interpretability and stability of the resulting explanations, and this sensitivity warrants further analysis. For the temperature-scaled softmax, ClusCAM uses a τ value less than one. I agree that without temperature scaling, the softmax weights can become overly uniform. However, a low τ amplifies noisy or erroneous signals. Therefore, it would be beneficial to validate this behavior using test images that contain large homogeneous backgrounds with small target objects. |
Moderately AI-edited |