|
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper reveals that past (implicit) fine-tuning attacks against aligned LLMs are "shallow," which can be universally blocked by generating the first several response tokens with an aligned model. The authors propose a novel "deep" attack that harnesses a "refuse-then-comply" strategy, bypassing the defense aforementioned.
* The work reveals a common failure mode of prior attacks due to their "shallowness", justified by a controlled study.
* The authors further propose a conceptual defense that renders prior attacks ineffective.
* The proposed "deeper" attack is rather novel, highlighting the necessity to make fine-tuning defenses deeper.
* Results and experiments are solid, and the revealed vulnerability has real-world impacts.
* Lack of comprehensive evaluation against stronger defenses. While the attack indeed bypasses the conceptual defense (AMD), a filtering defense with Llama-Guard (LG), as well as the constrained optimization defense, I believe a more comprehensive evaluation is necessary.
* For example, if the fine-tuning service provider adopts stronger LLMs (e.g., GPT-4o) as the output filter (rather than Llama-Guard), will the attack be ineffective?
* Then, can you think of combining your attack with encryption attacks like Covert Malicious Finetuning (CMF) that better evade such detection-based defense?
* While the proposed defense (AMD) successfully renders prior attacks ineffective, it may be impractical and may hurt the utility of models that undergo benign fine-tuning.
* The organization of Section 5 is confusing. Shouldn't you first introduce the setups of all your experiments in Sec 5.1, rather than Sec 5.2? Additionally, the results are too distributed and hard to keep track of.
* Typo (Line 353): "Llama 3.1 7b-Instruct" -> "Llama 3.1 70b-Instruct"?
* Why is the $\mathbb P(HR|R) $ of "Harmful" case so high (Table 2)?
* Why are the settings of Table 3 and 4 discrepant?
* Can you explain / analyze why "the efficacy of NOICE increases with the amount of training data (Figure 4 and Appendix K), whereas other attacks appear to plateau when trained with 1000 or more datapoints?" Is it still true when you fine-tune with even more datapoints?
* Can you conduct an ablation study of the constrained optimization defense with different constrained token numbers? |
Fully human-written |
|
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms |
Soundness: 4: excellent
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces NOICE, a fine-tuning attack that teaches models a "refuse-then-comply" behavior to bypass safety measures. The authors frame this as a "deep" attack, in contrast to "shallow" methods that are more easily defended. The method's effectiveness is demonstrated with high attack success rates on state-of-the-art models, including GPT-4o and Claude Haiku, and the vulnerability was validated by OpenAI.
1. **Strong Empirical Results:** The work's primary strength is its demonstration of a high-efficacy attack on current, production-grade language models.
2. **High Real-World Relevance:** The research directly addresses a timely and critical security vulnerability in fine-tuning APIs, supported by responsible disclosure.
1. **Clarity of Novelty and Framing:** The paper presents NOICE as a "new attack paradigm." However, given that the authors explicitly cite "successful pre-filling attacks" as their inspiration (line 202), the paper could better articulate how its contribution represents a paradigm shift rather than a novel and highly effective evolution of existing concepts.
2. **Incomplete Evaluation Methodology:** The evaluation framework has two significant gaps. First, the assessment of response harmfulness relies heavily on an LLM-as-a-judge, with an opaque description of the human validation process. Second, the evaluation omits the attack's impact on the model's general utility, leaving the performance trade-off unmeasured.
3. **Analysis Lacks Mechanistic Depth:** The paper successfully demonstrates that the attack works but offers little insight into why. The underlying generalization mechanism that allows the model to transfer the "refuse-then-comply" behavior from benign training examples to harmful prompts remains unexplored.
1. Could you provide a formal, falsifiable definition for a "deep" attack? In light of the cited inspiration from pre-filling attacks, could you further clarify the primary conceptual leap that qualifies NOICE as a new paradigm?
2. To increase confidence in the evaluation, could you provide key metrics from your human validation process, such as the sample size, inter-annotator agreement, and the protocol for handling human-LLM disagreements?
3. Did you measure the performance of the NOICE-tuned models on standard benchmarks (e.g., MMLU, GSM8K) to assess the impact on general utility? If so, what were the results?
4. What is your hypothesis for the generalization mechanism at play? Why does the model so effectively transfer this structured response pattern from harmless to harmful domains? |
Moderately AI-edited |
|
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a novel fine-tuning attack—NOICE—that teaches language models to first refuse harmful requests and then comply, bypassing token-level safety mechanisms. It significantly advances the threat landscape by revealing the limitations of shallow defenses. The methodology is sound and resource-efficient, with empirical validation on both open-source and production models like GPT-4o and Claude Haiku. The writing is clear and well-structured, and the findings carry real-world significance.
Here are the strengths of the paper:
Originality: Introduces a novel “refuse-then-comply” attack paradigm (NOICE) that reveals deeper vulnerabilities in safety-aligned models.
Quality: Demonstrates strong empirical results across multiple open- and closed-source models with low-cost fine-tuning and robust comparisons.
Significance: Highlights a real-world threat acknowledged by OpenAI and Anthropic, with practical implications for model safety and defense design.
Here are the weakness of the paper:
Narrow defense evaluation: The paper focuses mainly on token-level defenses like AMD and Llama-Guard, without testing against more robust, layered or semantic-based safety mechanisms.
Overreliance on model-based evaluation: Harmfulness is judged primarily using GPT-4o, which may introduce circularity or limit interpretability of the results.
Lack of mitigation strategies: While the attack is well-demonstrated, the paper does not propose or evaluate defenses tailored to the deeper attack strategy it introduces.
Could NOICE be mitigated by detecting refusal-followed-by-compliance patterns during inference?
Have you explored any methods for flagging or interrupting this behavior? |
Fully AI-generated |
|
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies fine-tuning-based jailbreak attacks on large language models. The authors identify that prior fine-tuning attacks are shallow, primarily manipulating the model’s first few response tokens to suppress refusal behavior. Building on this insight, they introduce NOICE, a new attack paradigm in which the model is fine-tuned to refuse initially and then comply, thereby bypassing defenses that only monitor or force the first few tokens. NOICE achieves substantially higher attack success rates than previous methods on both open-source and commercial models with relatively low fine-tuning cost. The authors also propose a simple defense mechanism (AMD) that works by forcing the first few response tokens to be generated by the original base model, before handing over generation to the fine-tuned model. AMD also calls for deeper and more systematic defenses against deep fine-tuning attacks.
1. The paper introduces a deeper class of fine-tuning attacking strategy, and offers a clear and unified explanation of why existing fine-tuning attacks are shallow.
2. The paper shows injecting “refuse then comply” patterns can be done with a modest budget, making it a realistic adversarial strategy for determined attackers, highlighting important safety considerations for fine-tuning APIs.
3. Empirical results show substantially higher attack success rates than previous methods on both open-source and commercial models.
1. The mitigation method (AMD) is explicitly non-comprehensive and acknowledged by the authors as insufficient; the paper is stronger in attack than defense.
2. More sophisticated finetuning attacks can be discussed to further demonstrate the broader impact and generality of the attack paradigm.
Can you discuss how fine-tuning with mixed-prefix attacks would work? For example, if the training data’s first few response tokens are “yes but sorry” followed by compliance. This would clarify whether NOICE’s effectiveness generalizes to stochastic/mixed attacks and whether AMD remains robust in expectation. |
Lightly AI-edited |