|
Learning to Reason for Hallucination Span Detection |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper propose RL4HS, a reinforcement learning framework that designed for hallucination span detection. The authors first conduct experiments comparing the pass@k performance, demonstrating that incorporating reasoning can be beneficial for span detection. Then, they train the model using GRPO with a span-level reward function. To address reward hacking issues caused by the imbalanced advantages between hallucination and non-hallucination cases, this work further introduce Class-Aware Policy Optimization (CAPO), which adjusts the advantages for non-hallucination predictions.
1. The paper presents a systematic scaling analysis, highlighting the potential of reasoning in improving hallucination span detection. These findings and insights could be valuable for guiding future research.
2. The study clearly identifies and analyzes the reward hacking issue caused by the imbalanced reward designs, and the proposed CAPO method offers a effective solutions.
1. Experiments are conducted exclusively on RAGTruth. It is unclear whether the proposed method generalizes to other hallucination datasets. There are other hallucination detection benchmarks with span-level annotations, such as FAVA, that could be included for a more comprehensive evaluation.
2. Although the authors compare to several reasoning and proprietary models, hallucination-specific baselines is limited.
3. The paper does not report case-level (or binary-level) results, so it remains unclear whether the span-level reward leads to consistent gains in overall performance.
4. The SFT baseline seems suboptimal relative to the original RAGTruth paper, possibly due to inappropriate learning rates (1e-6 in appendix) or hyperparameters, which may underestimate supervised performance.
See above section |
Fully human-written |
|
Learning to Reason for Hallucination Span Detection |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes RL4HS, a reinforcement learning framework for hallucination span detection in large language models. It builds upon Group Relative Policy Optimization (GRPO) and introduces Class-Aware Policy Optimization (CAPO) to handle class imbalance in rewards. The authors claim that RL4HS enables better reasoning for hallucination localization and outperforms supervised fine-tuning and prior reasoning-based baselines on the RAGTruth dataset. The study attempts to connect reasoning, reinforcement learning, and span-level detection but provides limited novelty beyond adapting existing RL techniques.
The topic—fine-grained hallucination detection—is relevant and timely, given the increasing importance of factual reliability in LLMs. The paper is well-organized, and the experimental setup is systematically described. The inclusion of span-level reward signals and the effort to analyze precision–recall imbalance show some awareness of practical issues in training reasoning-based detection models. The authors also provide qualitative examples to illustrate how reasoning might enhance model behavior.
Despite a clear structure, the work suffers from conceptual and methodological shallowness. The proposed RL4HS framework merely repackages existing GRPO methodology with a minor weighting adjustment, which can hardly be considered a significant algorithmic contribution. The claim that RL improves reasoning for hallucination span detection is weakly justified—there is no convincing evidence that the “reasoning” is genuinely learned rather than memorized through reward shaping. Experiments are conducted on a single dataset (RAGTruth), which raises concerns about generalizability. Furthermore, the comparison with GPT and Qwen models seems superficial and lacks control over model size, data exposure, and inference strategies. Many of the reported gains are marginal and could be attributed to overfitting or differences in fine-tuning procedures rather than the proposed RL method. The discussion of “in-domain reasoning” is vague and not theoretically supported. Overall, the paper feels more like an engineering report than a principled research contribution.
What specific novelty does RL4HS offer over GRPO beyond the scaling factor (CAPO)? Why is this sufficient for publication in a top-tier conference?
How is “reasoning” objectively measured or verified in this work? Are the CoT traces evaluated for correctness or interpretability?
Given that results rely solely on RAGTruth, can the authors demonstrate performance on unseen domains or datasets?
How are baseline models such as GPT-5 or Qwen3 controlled for fair comparison—are they fine-tuned under identical conditions?
The reported F1 improvements seem small; can the authors provide statistical significance tests or ablations to confirm that these are not due to random variation?
Could similar gains be achieved simply with improved supervised training or calibration, without the complexity of reinforcement learning? |
Fully AI-generated |
|
Learning to Reason for Hallucination Span Detection |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors reframe hallucination detection as a reinforcement learning problem instead of a simple classification one. The authors first show that using Chain-of-Thought (CoT) reasoning before predicting hallucinated spans makes the model’s predictions more diverse — so when decoding multiple times (say, K attempts), at least one prediction tends to be correct. Building on this, they design RL4HS, which fine-tunes a model using Group Relative Policy Optimization (GRPO) with a span-level F1 reward to explicitly encourage reasoning that improves hallucination span localization. However, they find that if the input has no hallucination, directly giving reward = 1.0 biases the model toward always predicting “no hallucination.” To fix this, they propose Class-Aware Policy Optimization (CAPO), which scales the non-hallucination advantages (by 0.5) to balance rewards between classes. They also confirm that simple fixes like Dr.GRPO cannot solve this imbalance. Experiments on RAGTruth (covering summarization, QA, and data-to-text) show that RL4HS significantly outperforms supervised fine-tuning and existing reasoning models, proving that reinforcement learning with span-level rewards and in-domain reasoning is essential for robust hallucination detection
- The paper offers a novel reformulation of hallucination span detection as a reinforcement learning problem, which is conceptually original and well-motivated.
- It carefully designs a span-level reward and introduces class-aware scaling to prevent reward hacking, showing thoughtful methodological innovation.
- The motivation analysis with Span-F1@K clearly illustrates the benefit of reasoning diversity and provides strong empirical grounding.
- Experimental results demonstrate substantial improvements over both supervised and reasoning baselines, highlighting the method’s effectiveness and significance.
- The paper is clearly written and logically structured, making complex ideas easy to follow.
- The approach mainly focuses on the RAGTruth benchmark; it remains unclear how well it generalizes to other OOD data. But indeed the author shows good transferability among different three subsets under RAGTruth, using and holdout setting.
- It could be beneficial to analyze and categorize what kinds of strategy the RL4HS model uses for producing more accurate hallucination span detection, by some human evaluation on the CoT paths.
- Here are only one qualitative study examples shown in the paper. It would be good if the authors can provide more in the appendix.
N/A |
Fully human-written |
|
Learning to Reason for Hallucination Span Detection |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper addresses hallucination span detection in large language models (LLMs), moving beyond binary detection to identify unsupported spans within generated text. The authors propose RL4HS, a reinforcement learning framework that leverages Chain-of-Thought (CoT) reasoning and introduces Class-Aware Policy Optimization (CAPO) to mitigate reward imbalance. Experiments on the RAGTruth benchmark (summarization, QA, data-to-text) show RL4HS outperforms both supervised fine-tuning and existing reasoning models, demonstrating the necessity of span-level rewards and in-domain reasoning for accurate hallucination detection.
- Novelty: First to train a reasoning-based hallucination span detector using RL with span-level rewards, addressing a gap in prior work focused on binary detection.
- Technical Contribution: CAPO effectively balances precision and recall, overcoming reward hacking issues found in standard GRPO.
- Empirical Rigor: Extensive experiments on RAGTruth across multiple tasks, withcomparisons to strong baselines (SFT, multi-view attention, proprietary reasoning models).
- Insightful Analysis: Ablation studies and case analysis convincingly show the benefits of in-domain reasoning and span-level reward optimization.
- Generality: Evaluation is limited to RAGTruth and a few CNLG tasks; broader applicability to other domains or real-world LLM outputs is not demonstrated.
- Model Scale: While RL4HS outperforms larger models, results for very large-scale proprietary models (e.g., GPT-5) are not fully explored.
- Complexity: The RL training setup (GRPO, CAPO) adds complexity and may be challenging to reproduce or deploy in production settings.
- Limited Error Analysis: The paper could benefit from deeper qualitative analysis of failure cases and limitations of RL4HS, especially in ambiguous or noisy contexts.
- Data Requirements: Reliance on span-level annotated data (RAGTruth) may limit adoption, as such datasets are rare.
- Generalization: How does RL4HS perform on hallucination detection in domains outside RAGTruth (e.g., medical, legal, conversational AI)?
- Annotation Efficiency: Can RL4HS be adapted to settings with limited or noisy span-level annotations? Is weak supervision feasible?
- Deployment: What are the computational and practical challenges for deploying RL4HS in real-world LLM pipelines?
- Failure Modes: What types of hallucinations or contexts remain challenging for RL4HS? Any observed systematic errors?
- Comparison to Post-hoc Methods: How does RL4HS compare to post-hoc hallucination correction or filtering approaches in terms of accuracy and efficiency? |
Fully AI-generated |