ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	0 (0%)	N/A	N/A	N/A
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	1 (25%)	8.00	4.00	3756
Lightly AI-edited	1 (25%)	6.00	3.00	2084
Fully human-written	2 (50%)	4.00	3.00	3469
Total	4 (100%)	5.50	3.25	3194

Title	Ratings	Review Text	EditLens Prediction
SuperActivators: Transformers Concentrate Concept Signals in Just a Handful of Tokens	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduce an interesting mechanism called the SuperActivator, showing that a small set of transformer tokens have distinct high activations for representing specific concepts, and thus can be used for capturing concept vector clearly. The authors provide detailed analysis and thorough experiment to introduce, validate, and leverage the mechanism for concept detection and localization, demonstrating the generalization of the mechanism across different modalities, models and concept types. The proposed mechanism is intriguing, featuring detailed analysis, a smooth narrative that fully introduces and argues for the SuperActivator mechanism to the readers, and extensive experiments to validate its effectiveness. The authors also provide well-documented code for validating and reproducing the results. The mechanism proposed in this paper achieves the extraction and representation of concept vectors by utilizing a small number of highly activated tokens, and it can improve the efficiency/accuracy of concept detection and localization. However, because it leverages only a small set of tokens, this approach may fail to provide a comprehensive view of the model's reasoning process, such as that offered by methods like training Sparse Autoencoders or using Concept Relevance Propagation. This is likely a difficult trade-off to overcome. 1. The paper acknowledges the challenge of neuron polysemy (i.e., a single neuron or direction encoding multiple unrelated concepts), which can undermine the accuracy of concept detection and localization. Given that the SuperActivator mechanism relies on a small set of highly activated tokens to capture concept signals, could the authors elaborate on how their framework and analytical pipeline mitigate the impact of neuron polysemy? For instance, do SuperActivator tokens inherently exhibit lower polysemy compared to other tokens, and if so, what evidence (e.g., quantitative analysis of semantic overlap between SuperActivators for distinct concepts) can be provided to support this? Additionally, are there any explicit designs (e.g., post-hoc filtering of polysemous tokens) in the current approach to reduce interference from polysemy when identifying or localizing target concepts? 2. Recent studies have highlighted phenomena like "attention sink" and the use of "register mechanisms" in transformer architectures, where specific positional tokens (often with fixed positions, e.g., initial or final tokens) consistently attract excessive attention weights during training and inference. Since the SuperActivator mechanism focuses on highly activated tokens, there is a potential risk of conflating attention sink tokens (which may have high activations due to positional bias rather than genuine concept relevance) with true SuperActivators. To address this concern: (1) Could the authors provide an analysis of the positional distribution of SuperActivator tokens across their experimental datasets (e.g., whether SuperActivators are disproportionately concentrated in fixed positions prone to attention sinks)? (2) Have the authors explored methods to disentangle attention sink-driven activations from concept-driven SuperActivator activations, and if so, what were the key findings? (3) More broadly, how do the authors view the relationship between the SuperActivator mechanism and attention sink/register mechanisms—are they independent, complementary, or overlapping phenomena—and what key differences distinguish them in terms of their underlying causes (e.g., positional bias vs. semantic encoding) and functional roles in transformer behavior? Overall, I consider this an excellent piece of work and look forward to the authors' discussion on the aforementioned questions.	Moderately AI-edited
SuperActivators: Transformers Concentrate Concept Signals in Just a Handful of Tokens	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces SuperActivators, a framework for interpreting transformer models by analyzing the concept response and determining the optimal threshold to mitigate the noise effect. By comparing the response distribution between the samples w/ and w/o the specific concepts, the paper determines the threshold based on the analysis result (using the 99-th quantile). Extensive experiments demonstrate that the proposed method outperforms baseline methods on various datasets on concept detection tasks. Additionally, it also improves the performance on various attribution methods when compared with the traditional concept method on attribution tasks. The proposed method can be applied not only to images but also to language. 1. The extensive experiments show that the SuperActivator outperforms the other baseline methods. 2. The proposed method offers a novel approach to determining the activation threshold for both image and language tokens, thereby mitigating the impact of noise. 1. To achieve optimal performance, the parameter N used for determining the decision threshold must be optimized for each concept. The selection of the layer also requires a similar optimized search. 1. Why are the SuperActivators found from the validation set instead of the training set? Does this misalign with the application in the real world, where we won’t get all the inference samples to calculate the $S^{+}_{val,c}$? 2. Is the SuperActivator also applicable to [1], which adopts an additional register token to alleviate the high norm features? 3. In Table 1, the baseline includes random, last, mean, and CLS token, which will be curious about comparing with the top-1 token. ** Minor question: Does the SuperTok in Figure 16 refer to the SuperActivators? [1] DARCET, Timothée, et al. Vision Transformers Need Registers. In: The Twelfth International Conference on Learning Representations.	Fully human-written
SuperActivators: Transformers Concentrate Concept Signals in Just a Handful of Tokens	Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces SuperActivators, a mechanism for improved concept detection and localization in images. The authors show that the distributions of concept activations for concept-containing vs concept-non-containing patches differ. Here, a tail of concept activations can be separated, which reliably indicates concept presence. In experiments, the authors show that SuperActivators lead to improved concept detection (based on presence of superactivators in image), as well as concept localization (based on various attribution maps). * Empirically validated finding that the high tail (eg 99th percentile) of concept activation (the "SuperActivator" patches) provides a clear and reliable signal of concept presence. * SuperActivators avoid aggregating concept activation across patches, keeping the concept activation signal patch-wise. This is an improvement over methods that pool activations (e.g., max or mean pooling). * SuperActivators provide a very consistent improvement across evaluations, which is a major practical strength as it suggests robustness of the proposed approach. For me, the biggest weakness of this work is that it does not contain any finding / contribution / experiment that particularly excites me, even though it is clear that the authors invested a lot of time in this work: As I udnerstand it, the big finding of this work is that there is a tail of concept activations, which has high likelihood of corresponding to the actual concept of interest. I find this unsurprising, as already two mean-shifted gaussian will show this behavior (a basic insight leveraged when hypothesis testing). The experiments show improvements, however, the baselines seem rather weak as they're mostly adapted from NLP to CV, and they struggle to beat RandTok. Also, in my limited experience, CAVs are used mostly for measuring reliance of a model on specific concepts, less for concept detection or localization (which maybe also explains the absence of available baselines). Thus, this repurposing might warrant a stronger motivation, as I am currently not convinced that CAVs for concept detection and localization make sense, due to the long history of object detection and localization in the computer vision field [1]. Additional concerns: * SuperActivators rely on a segmented validation set for determining the desired quantile (line 236). To my knowledge, concept (activation) vectors are predominantly used to measure a model's reliance on a given concept for which a labelled validation set is required. However, the additional need to segment can be an overhead. * A global threshold instead of a per-sample based one is also not a new idea, off the top of my head I can think of this paper [2] and this field [3] that use it too. * SuperActivators bases on the idea that the high tail of concept activation indicates reliable concept presence. However, for performance experiments, the threshold is tuned by validation F1 performance, therefore not focusing on the tail anymore, but instead optimizing for an optimal tradeoff between the two distributions. In my opinion, this dampens the importance of Section 3, as in the end, the method doesn't optimize for SuperActivators, but for a threshold that maximizes the F1 of in vs. out concept distribution, whose optimal solution will not separate only the in-concept tail, but instead lie somewhere in the support of both distributions. That is, the F1-optimized method in the end leverages the fact that the two distributions are non-identical, rather than the fact that the tail of one distribution is outside the support of another. Minor concerns: * The introduction motivates the power of unsupervised concept extraction, but in order to extract super activators, a concept-segmented validation set is required. Thus, SuperActivators are unable to perform unsupervised concept discovery and a wrong expectation is set, which might confuse the reader. * It seems contradictory that line 222 describes the 99th percentile of the out-of-concept set, while the threshold is afterwards defined using the in-concept set. [1] Köhler, Mona, Markus Eisenbach, and Horst-Michael Gross. "Few-shot object detection: A comprehensive survey." IEEE Transactions on Neural Networks and Learning Systems 35.9 (2023): 11958-11978. [2] Vandenhirtz, Moritz, and Julia E. Vogt. "From Pixels to Perception: Interpretable Predictions via Instance-wise Grouped Feature Selection." Forty-second International Conference on Machine Learning. [3] Angelopoulos, Anastasios N., and Stephen Bates. "Conformal prediction: A gentle introduction." Foundations and trends® in machine learning 16.4 (2023): 494-591. I invite the authors to address any misunderstandings and weaknesses I mentioned. * Do the baselines in Table 1 also rely on the presence of validation set segmentations, or do they only use the concept labels? * How are attribution methods adapted to explain the average embedding of local SuperActivators (and also of standard global concept vectors)?	Fully human-written
SuperActivators: Transformers Concentrate Concept Signals in Just a Handful of Tokens	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a new method for understanding if a cocenpt appears within an input image/ text using transformers architecture by prioposng a new aggregation method over the activations scores of intermediate tokens among all layetrs. However, they observe that positive sample, in contrast with negative sample, usually super activate in at least one of the tokens, thus, if we focus on highly activated tokens rather than average activations over tokens, could reult in an improved performance. The paper clearly describes the detection of a concept using token-level activation scores within a transformer. The contribution is small but clever: it builds on the observation that samples containing the concept typically include super-activated tokens, whereas negative samples may show modest activations but lack any highly activated tokens. Consequently, their method can outperform existing metrics such as F1, which jointly measure precision and recall. Moreover, through an attribution lens, they argue that super-activation more accurately identifies the tokens responsible for the existence of the desired concepts. I would like to see a deeper analysis of superActivator’s failure modes. In particular, the work lacks a systematic examination of when it fails. Can we identify, for each concept, the settings where it does not perform well? The concept study for text seems limited (e.g., sarcasm and a few emotions); can this be extended to more granular and diverse concepts, and would the method still hold up? More broadly, I’m interested in characterizing when the proposed method is sensitive versus robust, and in clarifying the factors (data properties, concept definitions, model layers, thresholds) that drive these behaviors. Also, how would context affect this method? For example, in language models, what happens if we add a (possibly unrelated) sentence at the beginning of the input—would that affect performance? How dataset-dependent are the thresholds and results, and how generalizable is the method? They are mentioned in the weakness section.	Lightly AI-edited

PreviousPage 1 of 1 (4 total rows)Next