|
Uncertainty‑Routed Human–LLM Curation and Calibration for ANLI |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces URC2 (Uncertainty-Routed Curation & Calibration), a three-stage pipeline for improving model calibration and dataset quality on the Adversarial NLI (ANLI) benchmark. Unlike prior methods that treat uncertainty as a single scalar, URC2 decomposes predictive uncertainty into aleatoric (data/label ambiguity) and epistemic (model disagreement) components, routing each to targeted supervision — human audit for ambiguity and LLM adjudication for disagreement — followed by retraining and temperature scaling.
The paper’s main contributions are:
1. Uncertainty-driven supervision: Ensemble-based decomposition of per-example uncertainty (aleatoric vs. epistemic) to guide distinct curation routes.
2. Human–LLM two-lane relabeling: Human annotators handle ambiguous items; an instruction-tuned LLM resolves confident model disagreements through self-consistent adjudication.
3. Calibration with disagreement reduction: Retraining with curated labels and weights plus lightweight temperature scaling reduces expected calibration error on ANLI by 30% (to 0.146) without sacrificing accuracy, while substantially lowering epistemic disagreement and improving corpus-level uncertainty distribution
1. The uncertainty-routed curation framework that distinguishes and acts on aleatoric versus epistemic uncertainty, turning uncertainty diagnosis into targeted supervision is novel and intuitive.
2. This paper proposes an interesting utilization of uncertainty decomposition, which incorporates human-in-the-loop for aleatoric-heavy samples.
3. The paper is well-organized and clearly written.
**According to ICLR2026 Author Guide, the paper should have only 9 pages at submission time for the main text. This submission has 10, which is very unfair for other submissions and should be desk rejected.**
N/A |
Fully AI-generated |
|
Uncertainty‑Routed Human–LLM Curation and Calibration for ANLI |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes URC2, a data relabeling pipeline that decomposes the uncertainties in the ANLI dataset into two categories: 1) examples where the original instance is ambiguous, 2) examples where the original instances are correct, but models diverge with high confidence, using a standard ensemble-based entropy analysis. Following the decomposition (also categorization) of ANLI examples, URC2 relabels the ambiguous examples with human annotators and relabels the high-confidence diverging cases with an LLM. By retraining models on the relabeled ANLI dataset, the authors show that the expected calibration error drops significantly, and disagreements between models decrease as well.
The overall design is sound and successfully operationalizes a decomposition->relabeling->retraining pipeline to show that clearer training signals (including labels and weights) can significantly improve the model's confidence calibration, and reduce disagreements. The relabeling pipeline proposed two lanes to handle different categories of uncertain examples, and achieved a balance between model relabeling and human efforts.
It seems to me that by employing LLM relabeling in Lane L, the URC2 pipeline essentially trusts the LLM's relabeling of the ANLI dataset over the original labels, even though the authors explain that only epistemic-heavy items are routed to the models. The authors should evaluate whether this trust is actually sensible by comparing it against human labels on selected samples. At the same time, this defeats the purpose of having humans to handle relabeling in Lane H, since a capable-enough LLM would be able to do it as well, since the authors assume that LLMs can make successful judgments on ANLI. I understand that humans relabel ambiguous cases, which is intuitively more challenging than epistemic-heavy cases, but the paper does not provide actual evidence for this claim. Given that I do not understand why Lane L and H are handled separately, the proposed URC2 pipeline essentially reads like a "fixing ANLI annotation error" work, and I do not see how it would be valuable to future research. At the same time, the authors should compare against a very simple baseline, which essentially uses an LLM to relabel every instance in the ANLI dataset.
Please see the weakness section. |
Fully human-written |
|
Uncertainty‑Routed Human–LLM Curation and Calibration for ANLI |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes URC2, a three-stage pipeline for adversarial NLI that first decomposes per-example predictive uncertainty into aleatoric (data/label ambiguity) and epistemic (model disagreement via mutual information) using a three-teacher ensemble (DeBERTa-v3-large, RoBERTa-large, XLM-R-large), then routes examples by their dominant uncertainty to a two-lane relabeling workflow—Human lane for aleatoric-heavy items (relabel/keep-hard with down-weight/drop) and LLM lane for epistemic-heavy cases (instruction-tuned LLM with self-consistency)—and finally refreshes and calibrates the teachers with curated labels and per-example weights plus lightweight temperature scaling; on ANLI, URC2 cuts dev ECE ~30% to 0.146 without sacrificing accuracy and collapses epistemic mutual information on curated subsets, shifting corpus-level uncertainty toward low-aleatoric/low-epistemic regions.
Contributions: (1) an operational, ensemble-based uncertainty decomposition that drives targeted supervision; (2) a human–LLM two-lane relabeling mechanism producing curated labels and instance weights; and (3) calibration with disagreement reduction, yielding better-calibrated, more reliable NLI under adversarial shift.
1. This work introduces an uncertainty-routed curation paradigm that explicitly disentangles aleatoric (data ambiguity) and epistemic uncertainty (model disagreement), enabling distinct treatments for each rather than collapsing them into a single undifferentiated scalar. The approach features a creative and pragmatic two-lane supervision framework: human annotators address cases of aleatoric uncertainty, while an instruction-tuned LLM with self-consistency mechanisms handles instances of epistemic uncertainty. This design effectively leverages the complementary strengths of human judgment and LLM reasoning to enhance dataset quality and model reliability.
2. Comprehensive diagnostic analyses—including risk–coverage curves, per-round evaluations, corpus-level uncertainty shifts, and ∆MI collapse on curated subsets—support the claim that URC2 genuinely resolves disagreement rather than merely smoothing it. Careful evaluation on ANLI, using both calibration metrics (ECE) and accuracy before and after temperature scaling, demonstrates a ~30% reduction in ECE without any loss in accuracy.
1. Multiple threshold values (e.g., r≥0.35 in line 221 and w=0.3 in line 237) are introduced without sufficient justification or explanation of their selection criteria.
2. While the focus on ANLI is reasonable, the claims would be stronger with evaluations beyond ANLI (e.g., [1]). In addition, several hyperparameters appear to be tuned on the ANLI dev split that is also used for reporting, which risks overfitting and makes the conclusions less definitive. Typically, adversarial benchmarks are held out solely for evaluation; we rarely assume access to an adversarial training set for hyperparameter selection. I recommend adding results on additional datasets (e.g., [1]) and using a strictly held-out test set or cross-dataset validation; if tuning on ANLI dev is unavoidable, include a sensitivity analysis and pre-freeze thresholds before final evaluation.
3. The paper does not report the standalone performance of the LLM used in the adjudication lane (e.g., accuracy and ECE on ANLI), which is necessary to estimate the reliability of this component. 2) The workflow relies on a single LLM for confident model disagreement; please include ablations with stronger or alternative LLMs (e.g., GPT-5, Qwen-3, Claude) and/or multi-LLM arbitration to assess whether the conclusions are strengthened or challenged under different adjudicators. 3) The paper mentions a quantized LLM; quantization can alter calibration and increase vulnerability to adversarial attacks [2].
4. The paper lacks a comparison with existing baselines, which is necessary to justify the effectiveness of the proposed method.
References:
1. Liu et al. 2020. An empirical study on model-agnostic debiasing strategies for robust natural language inference.
2. Lin et al. 2019. Defensive Quantization: When Efficiency Meets Robustness
Suggestions:
1. In Figure 1, the label in Stage A is incorrect. It should be placed on the box labeled “Routing route by uncertainty” to accurately reflect the intended process.
2. Figures 2a and 2b should use the same color scheme for "After refresh" to maintain consistency; using different colors could cause confusion for readers. |
Moderately AI-edited |