ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 3 (100%) 5.33 3.67 3166
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 0 (0%) N/A N/A N/A
Total 3 (100%) 5.33 3.67 3166
Title Ratings Review Text EditLens Prediction
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces X-GRAAD, a novel inference-time defense framework for detecting and mitigating backdoor attacks in pre-trained language models (PLMs). The method's core idea is that backdoor trigger tokens, when activated, abnormally dominate the model's attention and gradient attribution signals simultaneously. The proposed method operationalizes this insight by combining these two signals to assess the anomaly score of each token. It identifies malicious inputs by searching for a single "peak" token with an exceptionally high score. If such a token is found, it is neutralized via a noise injection mechanism before the model generates its final prediction. The authors demonstrate experimentally that this method effectively reduces Attack Success Rates (ASR) while maintaining high Clean Accuracy (CACC). The paper also highlights the method's explainability, showing its ability to localize trigger tokens. 1. **Novel Core Hypothesis.** The core hypothesis—that backdoor triggers manifest as strong, simultaneous anomalies in both attention and gradient channels—is an intuitive and novel insight. Combining these two distinct attribution modalities for anomaly detection is a strong starting point. 2. **Interpretability.** The method not only provides a defense but also offers interpretability via attribution scores (as shown in Figures 2 and 5), helping to localize and understand the behavior of backdoor triggers, which is a valuable feature. 3. **High Practicality (Efficiency and Simplicity).** As an inference-time defense, X-GRAAD requires no model retraining or fine-tuning. Its computational overhead is far lower than many existing model purification methods as shown in Table 4. This efficiency, combined with its implementation simplicity (relying on standard attribution tools), makes the method highly practical for real-world deployment and reproducibility. 1. **Narrow and Simplistic Threat Model.** The paper's primary weakness is that its strong performance claims are based on an evaluation against a very narrow and simplistic threat model. The experiments (Sec 5.1) almost exclusively use triggers that are short, non-semantic, rare words (e.g., cf, mb). These triggers are statistical outliers by design and are "easy" targets for any attribution-based anomaly detector. The evaluation completely omits more advanced, stealthy attacks such as semantic triggers (synonyms), syntactic triggers, or longer phrasal triggers, making it hard to assess the method's generalizability. 2. **Potential Design Limitations and Vulnerability to Adaptive Attacks.** The defense mechanism's design presents potential limitations. Its reliance on the max operator (Eq. 8) to find a single peak score appears vulnerable to "distributed triggers"—a plausible adaptive attack where an adversary uses multiple tokens, each with a low, non-anomalous score, to activate the backdoor. Furthermore, the generality of the character-level "noise injection" (Sec 4.2.2) is unclear. While suited for the simple tokens tested, it may be less effective or could potentially risk CACC against semantic triggers (e.g., changing "price" to "pride"). 3. **Misleading "Robustness" Analysis.** The paper fails to test its robustness against these obvious adaptive attacks. Section 5.2.3 is mislabeled as a "Robustness Analysis" when it is merely a hyperparameter sensitivity analysis (for the detection threshold $\tau$). A true robustness evaluation would have tested the defense against an attacker aware of its max-based design, using the very "distributed trigger" attack mentioned above. The absence of this analysis is a significant gap. 4. **Limited Methodological Novelty.** While the idea of combining attention and gradients is smart, the method itself is a relatively straightforward heuristic. The components used (attention maps and input gradients) are standard interpretability techniques, and their combination (a simple product) lacks significant methodological innovation. 1. The paper's positive results are based on triggers that are easily isolated (short, rare words), which aligns perfectly with the max operator-based detection (Eq. 8). How would X-GRAAD perform against "distributed triggers" where the backdoor is activated by multiple tokens (e.g., a phrase) that each contribute a small, non-anomalous score? 2. Following Q1, have the authors evaluated X-GRAAD against more stealthy triggers that are part of the natural language distribution, such as semantic triggers (e.g., a specific synonym replacing a common word) or syntactic triggers (e.g., a specific sentence structure)? 3. The robustness analysis in Sec 5.2.3 is a hyperparameter sensitivity test. Could the authors provide a more formal adversarial robustness analysis? Specifically, can they comment on how X-GRAAD would fare against an adaptive attacker who is aware of the max-based design and explicitly crafts a distributed trigger to bypass it? 4. Regarding the "noise injection" (Sec 4.2.2): What is its impact in two failure-case scenarios? (a) If the trigger is a semantic word, how is it neutralized? (b) If the model falsely identifies a critical, clean token (e.g., "price") as a trigger and corrupts it (e.g., to "pride"), what is the measured impact on Clean Accuracy (CACC)? Fully AI-generated
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies backdoor detection for LLM embeddings and proposes a framework to identify whether an embedding model is Trojaned based on contrastive probing and embedding-space consistency tests. The authors argue that backdoor attacks create detectable inconsistencies in the geometry of embedding space. They introduce an embedding residual consistency score that compares clean-prompt vs trigger-prompt embedding behavior without requiring model weights or activation access, and perform evaluations across multiple backdoored and clean LLM embedding models. Experiments suggest that the proposed scoring metric can distinguish Trojaned models across different triggers and poisoning rates while maintaining low false positives. 1. The residual-consistency metric is lightweight and does not require model internal access, making it potentially practical for model vetting. 2. The paper shows results across multiple backdoored settings and trigger types, demonstrating reasonable detection performance with low false-alarm rates. 1. Evaluations seem focused on standard patch/text triggers; emerging semantic or concept-level backdoors are not included, limiting robustness claims. Also, the defense might not work on the style / synthetic triggers. (1) Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer; 2) Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger.) 2.Limited large-scale models: Most experiments appear to use medium-scale embedding models; testing with modern foundation embedding models would strengthen impact. 1. Have you tested the method against more subtle backdoors (e.g., syn backdoor or style backdoor attack) where the trigger is not simple phrase? 2. Can the method be extended to detect clean-label backdoors where the embedding shift might be less explicit? Fully AI-generated
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes X-GRAAD, a novel inference-time defense mechanism designed to detect and mitigate backdoor attacks in PLMs. The central idea is that in backdoored models, trigger tokens disproportionately dominate both attention weights and gradient attributions. X-GRAAD leverages this insight to compute token-level anomaly scores by combining normalized attention and gradient signals. Sentences with high anomaly scores are flagged as suspicious, and the most anomalous tokens are neutralized by injecting character-level noise, thereby preventing trigger activation without retraining or modifying the model. 1. The method combines token-level attention and gradient score to compute anomaly score. 2. The experimental results show that the method consistently achieves state-of-the-art performance. 3. As an inference-time defense that requires no model retraining, fine-tuning, or complex pruning, X-GRAAD is more efficient than many other competitors. 4. The method not only defends but also explains its decisions by localizing the suspected trigger token through the anomaly score. 1. The trigger neutralization mechanism (random character-level perturbation) is relatively naive. While the results show it works, it feels less sophisticated than the detection mechanism. 2. The defense requires access to both attention weights and input gradients, which may not be available in many deployment scenarios (e.g., black-box APIs, closed-source models). The paper does not discuss how the approach generalizes to limited-access settings. 3. The detection threshold (e.g., 95th percentile of clean validation scores) is tuned manually and dataset-dependently. Line 327 notes that ALBERT requires a lower threshold (65th percentile vs. 95th) and shows slightly elevated ASR in one case (ALBERT-LWS on SST-2). 4. While the empirical results are strong, the paper lacks a clear theoretical justification for why the product of normalized attention and gradient magnitudes is an optimal anomaly indicator. A deeper analysis (e.g., statistical or information-theoretic motivation) would strengthen the contribution. 5. While a robustness analysis over anomaly thresholds is presented, there is no study on adaptive or adversarial countermeasures (e.g., trigger patterns designed to minimize gradient-attribution visibility). See weakness Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next