|
CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces CAP (Comparative Augmented Prompting), a defense framework designed to improve the robustness of LLM-as-a-Judge systems against adversarial score manipulation. Motivated by the observation that comparative assessments are inherently more robust than absolute scoring, CAP integrates comparative principles into the absolute scoring process. Specifically, it employs a Tutor LLM to generate high-quality and low-quality reference outputs, which are refined via activation vector steering to serve as sample-specific anchors.
1. he paper provides a fresh perspective by importing comparative assessment principles into absolute scoring defense. This bridging insight is conceptually elegant and experimentally justified.
2. Figures and algorithmic descriptions are intuitive (especially Figure 3 illustrating the CAP workflow). The writing is generally clear and logically structured.
1. Efficiency and scalability – The approach requires an additional Tutor model invocation per evaluation, leading to 10–30× slower inference (Table 5). Although the paper acknowledges this, there is no exploration of smaller Tutors or precomputed reference caching. A study on how Tutor size or layer choice affects robustness vs. cost would make the work more practical.
2. Limited baseline diversity – The baselines are restricted to Perplexity-based detection and Chain-of-Thought prompting. Since score manipulation overlaps with broader prompt-injection/jailbreak attacks, comparisons with simple sanitization or rewriting-based purification defenses are missing and could clarify whether CAP brings unique advantages beyond input preprocessing.
There are some typos, e.g., in Line 167, 'together with an expert reference (), typically produced by human' -> 'together with an expert reference (), typically produced by human'
1. Transferability of standard vectors: Can the extracted steering direction learned on one dataset (e.g., SummEval) generalize to another (e.g., TopicalChat)? In real-world evaluation, user inputs vary widely—would CAP require re-estimating standard vectors per task/domain?
2. CoT baseline setup: The Chain-of-Thought prompt used here seems to focus on multi-step reasoning rather than explicit defense reasoning. Prior work such as “Unraveling the Mystery: Defending Against Jailbreak Attacks via Unearthing Real Intention” suggests first summarizing user intent before response. Did you test such CoT variants that explicitly incorporate intention extraction or self-verification steps? |
Fully AI-generated |
|
CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes CAP, which addresses the issue of "against adversarial score manipulation" that may occur in LLM-as-a-Judge, and uses comparison principles to inject absolute score evaluation to defend this issue. Specifically, CAP utilizes high-score and low-score preference pairs generated by a TUTOR LLM, which are modified through activation vectors, as reference examples to guide robust scoring.
1. The paper provides a preliminary study to investigate comparative evaluation versus absolute evaluation, and proposes a preference data generation scheme to generate high-quality and low-quality example anchors for models in comparative evaluation.
1. $\textbf{Justifications of the key design choices are weak.}$ The paper provides insufficient justification or rationale behind its key method design:
- The paper does not justify the choices of "standard references geneartion" in Section 4.2 and Section 4.3 .
- Arbitrariness of standard vector thresholds: The paper sets the scoring thresholds for high/low standard reference texts as the 80th and 20th percentiles of the samples generated by the TUTOR model, but fails to explain why these two percentiles are the optimal choices, and there is also a lack of explanation and analysis regarding the number of generated samples.
2. $\textbf{Significant issue of efficiency trade-off is under-explored.}$ CAP has a critical practical limitation — extremely high efficiency costs. Although the paper acknowledges this trade-off, it does not conduct sufficient exploration or scenario-based analysis on it:
- The quantitative gap is obvious: As can be seen from Table 5 (TopicalChat dataset) and Table 9 (SummEval dataset), compared with the "no defense mechanism" scenario, CAP increases the evaluation time of small open-source models by 40 to 70 times. For example, the FlanT5-XL model takes only 4.0 seconds to process a single sample without defense, but 162.4 seconds when CAP is enabled (CAPₗ configuration) and 283.5 seconds with the CAPₘ configuration; even for API-based models like ChatGPT-3.5, CAP adds approximately 100 seconds of extra time per sample.
- Lack of optimization exploration: The paper describes this efficiency loss as "a reasonable price to pay for robustness" but does not explore improvement schemes that can enhance efficiency — such as reusing reference anchors for similar samples (instead of generating unique references for each sample), using a TUTOR model with a smaller parameter scale, or caching activation vectors. Without such optimizations, CAP is completely impractical in large-scale evaluation scenarios.
3. $\textbf{Types of tasks for evaluation are limited.}$ The scope of experimentation is relatively narrow, which makes it hard to assess the performance of CAP outside the tested tasks. All experiments focus on two types of tasks — text summary evaluation (SummEval dataset) and dialogue response evaluation (TopicalChat dataset). The paper does not apply CAP to other high-risk LLM-as-a-Judge scenarios, such as code generation evaluation or factual accuracy evaluation. For tasks where evaluation criteria are subjective or domain-specific, the definition of "comparative reference text" may be more difficult to delineate, and the effectiveness of CAP in such tasks has not been verified.
4. $\textbf{Lack of sufficient empirical analysis on preference data generation.}$ Specifically:
- In Related Work, there is no review on the related methods for generation of preference data.
- In ablation study, the comparative results of preference data quality are insufficient. The only comprative baseline for comparison is W-CAP, which uses different instructions to make the model generate high-quality summaries and low-quality summaries. This comparison with CAP may not be able to illustrate the effectiveness of Standard Vector Identification. It is needed to supplement comparative experiments such as: experiments using High-Standard Score and Low-Standard Score as preference data in the process of Standard Vector Identification, and experiments with other preference data generation schemes.
- The paper can also benefit from a case study comparing the preference data generated by the proposed method with those generated by other relative methods.
5. $\textbf{This paper also has some presentation issues.}$ for example:
- Multiple errors in the direction of double quotes, Lines 92 and 100.
- The title of the prompt in Line 864 is incorrect.
- In the experiment of Section 2, why is it reasonable to compare between score and probability in Figure 2?
Please see the weaknesses above. |
Fully human-written |
|
CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the vulnerability of LLM-as-a-Judge systems to adversarial score manipulation attacks by proposing CAP (Comparative Augmented Prompting), a defense framework. The core idea of this method is to integrate comparative assessment principles into absolute scoring scenarios by using a TUTOR LLM to generate high-quality and low-quality reference samples as anchors, while employing activation vector modification techniques to ensure consistency in reference sample quality. Experiments on two datasets validate the effectiveness of CAP in defending against both white-box and black-box attacks on open-source and API-based models.
**1. Important and Practically Significant Research Problem:** The paper addresses the security issues of LLM-as-a-Judge systems, which represents a critical challenge in the current automatic evaluation field. The authors clearly demonstrate the severity of manipulation attacks, showing that adversarial attacks can inflate scores from 2.7 to 4.3. This research has strong practical value.
**2. Comprehensive Experimental Design:** The experiments cover 5 judge models (2 open-source + 3 API-based) and 2 tutor models, testing white-box attacks (AdvSuffix) and black-box attacks (DSI, BED), and include ablation studies, adaptive attack testing, and efficiency analysis.
**3. Clear Writing:** The paper is well-written with clear logic and smooth flow.
**1. Lack of Statistical Significance Verification:** The paper does not explicitly report statistical significance verification, including multiple repeated experiments and significance testing, making it impossible to determine whether the observed differences exceed the range of random fluctuation. The SummEval dataset contains 100 source documents, each with 16 machine-generated summaries, and the TopicalChat dataset contains 60 conversational contexts, each with 6 machine-generated responses. These datasets are relatively small in scale, and results from single experiments may be influenced by sample selection, making multiple repeated experiments particularly important for verifying result stability. The paper uses multiple large language models as judges and tutors, and these models have inherent randomness in the generation process, making repeated experiments even more necessary to verify experimental validity.
**2. Relatively Simple Adaptive Attack Design:** The adaptive attack in Section 5.6 only uses prompts to "ignore reference texts," which is a relatively basic attack strategy. Meanwhile, Table 7 shows that in some cases, the adaptive attack effect is even lower than standard attacks (e.g., FlanT5-XL: 12% vs 49%), further indicating that the adaptive attack design is insufficient.
**3. Insufficient Parameter Selection and Hyperparameter Sensitivity Analysis:** The method involves multiple key parameters, but the sensitivity analysis is insufficient. Although Appendix B mentions sensitivity analysis, and Tables 12 and 13 display the αh and αl values under different dataset and model combinations, the main text does not adequately explain how these values were selected (through grid search optimization) and why different combinations require different parameter values. Although Appendix D.3 mentions selecting α values by "calculating the mean and variance of the standard references", and Figure 10 shows score changes under different α values, the explanation is insufficient. What are the selection criteria? How is the quality of high-standard and low-standard references balanced?
**4. Insufficient Discussion of Method Limitations:** The paper lacks in-depth discussion of the method's applicable scope and failure scenarios. There is insufficient analysis of cases where defense effectiveness is poor in Tables 2-3 (such as Gemini-2.0 + DSI: 33%). There is no discussion of the impact of tutor model quality on defense effectiveness. Standard vector identification requires constructing a scoring reference set, which may be difficult to obtain in certain domains.
See Weaknesses |
Moderately AI-edited |