|
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
MIRAGE-Bench introduces the first unified benchmark for hallucinations in interactive LLM agents. It proposes a three-part pipeline: (1) a 3-way taxonomy (unfaithful to instructions / history / observations); (2) 6 risk settings + contextual snapshot freezing to elicit reproducible hallucinations; (3) risk-specific LLM-as-a-Judge for scalable action-level verification.
+ Fills a critical gap: First systematic benchmark for interactive agent hallucinations—beyond single-turn QA (TruthfulQA, HaloGEN) and success-only agent evals (WebArena, AgentBench). Table 1 clearly shows missing dimensions.
+ Strong taxonomy: Grounded in ReAct loop; each category maps to real-world risks (e.g., credential leak via fake navigation, Fig 2).
+ Snapshot innovation: Freezing full context (instruction + history + observation) at hallucination-prone steps eliminates stochasticity while preserving multi-turn complexity. Enables environment-free, reproducible testing.
+ Smart positive design: Treats "acknowledge uncertainty / refuse / report infeasibility" as faithful behavior (e.g., Out of Scope Queries)—a safety-minded shift from "always answer" paradigms.
- Human evaluation critically under-specified: Only 160 samples used for judge validation. No annotator expertise reported (e.g., agent safety researchers?). No inter-annotator agreement (Cohen’s κ, Krippendorff’s α). If more detailed information such as a substantially larger validation set documented domain expertise of raters and published inter-rater reliability scores were provided the credibility of the AI-safety assessment would be significantly strengthened.
- Multi-turn" claim vs. snapshot reality: Snapshots are static slices of multi-turn trajectories. They test single-step faithfulness under long context, not dynamic accumulation of hallucinations over turns. Misalignment with paper’s framing as a “multi-turn hallucination” benchmark.
- Analysis depth missing: No correlation studies: model size vs. hallucination type, snapshot depth vs. error rate, risk complexity vs. failure. No ablation on judge prompt design, snapshot selection criteria, or error cases.
- Dataset transparency weak: 1,050 samples unevenly distributed (8.4%–22.1%). No per-environment breakdown, no raw trajectory release.
See weakness. |
Fully AI-generated |
|
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a unified benchmark, MIRAGE-Bench, designed to evaluate hallucinations in LLM agents. The authors propose a three-part taxonomy of agentic hallucinations, defined by unfaithfulness to task instructions, interaction history, or environment observations. The benchmark leverages a contextual snapshot strategy to isolate and reproduce hallucination-prone decision points across multiple environments and tasks. Evaluation is conducted through LLM-as-a-Judge, which enables scalable and fine-grained assessments. Through quantitative and qualitative analyses across twelve open-source and proprietary models, the study reveals the pervasiveness of hallucinations and argues that they are not mitigated by scale or model size alone.
1. This paper proposes a unified taxonomy of agentic hallucinations that categorizes failures based on unfaithfulness to task instructions, interaction history, and environmental observations.
2. The contextual snapshot strategy addresses non-determinism and setup complexity of full environments, which enables stable and reproducible evaluations without requiring full environment rollouts.
3. The benchmark covers a diverse range of interactive environments, spanning web, OS, software-engineering, and task-oriented multi-agent contexts.
1. Relies solely on Claude-3.5-Sonnet as the judge model, which may introduce bias or limit generalizability. The evaluation would be more robust with cross-validation using multiple judge models (e.g., GPT, Gemini) or ablation on judge sensitivity.
2. More advanced state-of-the-art LLMs such as GPT-5, Gemini 2.5 Pro, and Claude-4-Sonnet/Opus are not evaluated. Would models with larger reasoning abilities alleviate agentic hallucinations? The paper lacks an ablation study on models with varying reasoning capabilities.
3. The authors provide some analyses but stop short of proposing of evaluating some concrete mitigation strategies beyond a vague call for "training on risk contexts.
Please see the weakness section above |
Fully human-written |
|
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces MIRAGE-Bench, a unified benchmark for eliciting and evaluating hallucinative actions in LLM-based agents. The authors define a three-part taxonomy of unfaithful behaviors (to task instructions, interaction history, and environment observations) and propose a snapshot elicitation strategy that freezes risky decision points for deterministic and reproducible evaluation. Furthermore, a risk-aware LLM-as-a-Judge protocol labels each action as faithful, incomplete, or hallucinative to derive Utility Scores (US) and Hallucination Rates (HR). Experiments span diverse environments, including web, operating systems, software engineering, and inter-agent tasks, revealing that hallucinations persist even in strong proprietary models.
(1) This paper formally defines hallucinative actions and distinguishes three types of unfaithful behaviors (task instructions, interaction history, and environment observations) thereby extending the notion of hallucination from natural language generation to action-level decision-making in interactive agents.
(2) The paper clearly defines key concepts such as hallucinative actions, the snapshot strategy, and the risk-aware LLM-as-a-Judge framework, and presents a logical and easy-to-follow flow from motivation to methodology to results. The writing is concise, terminologically consistent, and technically transparent, making the presentation clear and accessible.
(3) The study presents a well-structured experimental design covering six representative risk scenarios. Its risk-aware LLM-as-a-Judge framework with three-way classification (faithful / incomplete / hallucinative) enables fine-grained evaluation via Utility Score (US) and Hallucination Rate (HR).
(1) The benchmark focuses on six well-chosen but mainly text- and web-centric settings. Including non-web domains (e.g., embodied or multimodal agents) would improve generality and demonstrate broader applicability.
(2) Although the snapshot strategy ensures reproducibility, the paper does not show whether snapshots preserve full contextual fidelity.
Experiments comparing snapshot vs. full-trajectory evaluation or perturbation tests would better support this assumption.
Please see Weaknesses |
Lightly AI-edited |
|
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes MIRAGE-Bench, a unified benchmark for studying hallucinations in agentic language models. It uses a three-part taxonomy, a set of contextual “snapshot” test cases, and an LLM-as-a-Judge evaluation scheme. The goal is to provide a principled testbed for analyzing hallucinations in interactive agents. Extensive experiments across multiple environments reveal widespread unfaithfulness among current agents. The results also show that proprietary and open-source models perform similarly, indicating that scaling alone cannot guarantee faithfulness.
1. The paper introduces the first single benchmark, MIRAGE-BENCH, and a clear, three-part categorization to systematically study and evaluate when and why LLM agents hallucinate.
2. To solve the problem of unpredictable agent behavior in dynamic environments, the authors use a new contextual snapshot strategy to reliably repeat and test agent decisions at specific failure points.
3. The research goes beyond simple scoring to analyze why hallucinations happen, revealing that agents often fail because their training data is too focused on "successful workflows," causing them to ignore critical error feedback.
1. The LLM-as-a-Judge setup limits the reliability of evaluation. Validation is based on only 160 human-labeled samples with moderate agreement, which is insufficient to ensure trustworthiness. Relying on one LLM to judge another introduces unverified bias and instability, especially under prompt variations.
2. The Contextual Snapshot Strategy sacrifices dynamic fidelity for reproducibility. By freezing the agent’s state before potential hallucination points, the benchmark reduces complex multi-turn reasoning to isolated steps. It therefore fails to capture long-horizon planning, feedback integration, and recovery abilities crucial for real-world agents.
3. The paper diagnoses a key “successful workflow” bias but lacks an effective mitigation. While the analysis convincingly links hallucination to overfitting on optimal trajectories, it offers no concrete or tested method to reduce this bias. Merely calling for future work on “risk settings” leaves the contribution incomplete.
4. The benchmark’s generalizability is limited by dependence on structured environments. Most data come from existing benchmarks with structured HTML trees or terminal outputs, making some risk types (e.g., Pop-up Distractions) ineffective. It underrepresents hallucinations in unstructured text, documents, or visual contexts found in generalist agents.
5. The conceptual boundary between hallucination and general error remains unclear. Many reported hallucinations could be reframed as planning or attention failures. This ambiguity weakens the core claim that scaling offers little gain in faithfulness, as results may reflect labeling uncertainty rather than genuine performance limits.
Please refer to the above-mentioned weaknesses. |
Fully AI-generated |