ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 3.00 2268
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 6.00 3.00 1530
Lightly AI-edited 2 (50%) 6.00 3.50 2076
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 6.00 3.25 1988
Title Ratings Review Text EditLens Prediction
Soft Instruction De-escalation Defense Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a lightweight defense mechanism for LLM-based Agents, designed to resist Prompt Injection attacks. The authors introduce an iterative and lightweight purification approach, which gradually rewrites potential instructions to neutralize malicious behavior. This design avoids two common weaknesses in previous methods: 1. vulnerability to adversarial rephrasing, and 2. overly aggressive detection that causes frequent false positives. Through evaluations on multiple LLMs, the authors demonstrate that SIC can reduce the Attack Success Rate (ASR) to zero across all tests while preserving the Agent’s original task performance—showing the clear effectiveness of the method. Additionally, the work emphasizes parallelization, and the reduced computational cost makes it well-suited for real-world deployment. 1. By employing a preprocessing module or introducing an LLM-as-a-Judge component, the method avoids modifying internal model parameters, making it friendly to black-box models. 2. The multi-round strategy is more robust than a one-shot approach, and experimental results strongly demonstrate the effectiveness of this iterative mechanism. 3. The method is efficient and parallelizable, while maintaining the original task performance, which is crucial for its practical applicability. 1. Some detector or rewriter designs rely on external LLMs. Has the paper considered the scenario where the attack itself targets these external LLMs? Could this lead to delayed defense response or even worse cascading failures? 2. The performance of both the rewriter and the detector depends heavily on the quality of their prompt templates. Combined with the first concern, how does the framework ensure the robustness and diversity of these templates under adversarial conditions? 3. Regarding the choice of the auxiliary LLM, is there any ablation study or selection criterion provided? How do the alignment quality and model size of the auxiliary LLM affect the performance and latency of the SIC framework? Please see weaknesses. Lightly AI-edited
Soft Instruction De-escalation Defense Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Soft Instruction Control (SIC), an iterative sanitization defense against prompt injection attacks in tool-augmented LLM agents. The core idea is to repeatedly rewrite untrusted input to remove imperative instructions, inject canary instructions to detect rewriting failures, and use chunked classification to verify no malicious instructions remain. The method is evaluated on AgentDojo benchmark across multiple models (GPT-4o, Qwen3-32B, Kimi-k2, GPT-4.1-mini) and achieves 0% ASR on standard attacks while maintaining reasonable utility. However, under adaptive attacks (Nasr et al., 2025), the defense achieves only 60% ASR, revealing three primary failure modes: embedded executable workflows, authority-styled language, and partial-failure narratives. 1. Combining rewriting, canary detection, and chunked classification creates defense-in-depth that is harder to bypass than single-layer approaches. 2. Testing across multiple SOTA models (GPT-4o, Qwen3-32B, Kimi-k2, GPT-4.1-mini) and three task domains demonstrates generalizability on standard benchmarks 3. Unlike many security papers, the authors conduct worst-case analysis and clearly document three failure modes with concrete examples Strong performance on standard attacks: Achieving 0% ASR on AgentDojo attacks while maintaining 50-55% utility is impressive compared to baselines 1. The paper assumes white-box access (Section 3) but the adaptive attack reveals the defense relies on assumptions (e.g., "instructions are imperative") that adversaries can trivially violate. The threat model should explicitly state what adversarial capabilities are not covered. 2. Section 4.2 claims "latency remains small" but provides no actual measurements. For production systems processing thousands of requests, the cost of R+1+k LLM calls per input could be prohibitive. 1. Why is there no comparison with CaMeL (Debenedetti et al., 2025)? This is the primary related work and claimed inspiration. What are the trade-offs between SIC's soft approach and CaMeL's formal decomposition in terms of both security and utility? 2. What happens if an attacker discovers the specific canary text? Have you tested with randomized canaries or multiple diverse canaries? How does performance change? Fully AI-generated
Soft Instruction De-escalation Defense Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Soft Instruction Control (SIC), a defense method designed to counter indirect prompt injection attacks in LLM agents. SIC performs iterative instruction detection and rewriting to “soften” and sanitize input data, effectively preventing malicious instructions from being activated during the agent’s execution phase. Experiments conducted on the AgentDojo benchmark, covering various models and attack scenarios, demonstrate that SIC can significantly reduce the attack success rate (ASR) to 0% while maintaining high task utility. - SIC combines iterative rewriting and detection to establish a “soft control” defense mechanism. This design balances defense effectiveness with performance and offers strong deployment feasibility. - Extensive evaluations were conducted on the AgentDojo benchmark, covering various models and attack scenarios. Comparisons with other defenses, such as MELON and PI-GUARD, further validate the effectiveness of SIC. - The paper presents a theoretical analysis of SIC’s latency but lacks experimental comparisons with other defense methods. This limitation makes it difficult for readers to fully assess SIC’s overall performance across the “security–utility–efficiency” trade-off. It is recommended to include detailed latency comparison experiments among different defense methods to further strengthen the practical validation of the approach. - The paper does not specify the exact auxiliary LLM model used in the SIC method. Since the performance of SIC may depend on the capability of the auxiliary model, it is recommended to include ablation studies to quantitatively illustrate the direct impact of the auxiliary model’s performance on SIC’s defense effectiveness. - Do the results of the adaptive attack reported in Section 6 (ASR = 60% in the Slack scenario) generalize to other tasks and models? Could more experimental results be provided? - What is the defense cost associated with the auxiliary model used in SIC? Would training or fine-tuning a lightweight model separately reduce the defense cost or enhance the defense effectiveness? Lightly AI-edited
Soft Instruction De-escalation Defense Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper primarily investigates the performance of different defense mechanisms against various instruction injection attacks, with a focus on analyzing the robustness of the SIC method across multiple models and tasks. The article demonstrates that SIC can reduce attack success rates to 0% under various attack types while causing only minimal degradation in utility. Through ablation studies, it explores the impact of rewriting and chunking mechanisms, explaining how the chunking mechanism helps reduce attack success rates while also noting the potential false positive issues it may introduce. 1. The SIC method maintains 0% ASR across various attack types and models, demonstrating significant robustness. 2. The comparative analysis system is comprehensive, including different models and existing defense methods. 3. Ablation experiments are included, explaining the underlying reasons why the chunking mechanism reduces attack success rates. 1. The analysis of false positive sources is relatively brief, only mentioning "instruction-like statements," lacking more specific classification or mitigation strategies. 2. The experiments primarily focus on plaintext prompt injection, lacking validation against more covert multi-turn or cross-modal attacks. 3. There is insufficient detailed evaluation of defense overhead, such as computational resource consumption and response latency. 4. In multilingual environments, can SIC still effectively identify and intercept instruction injections? see Weakness Section Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next