ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 5688
Heavily AI-edited 2 (50%) 4.00 4.00 2960
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 2.00 4.00 7730
Total 4 (100%) 3.50 4.00 4834
Title Ratings Review Text EditLens Prediction
Characterising Overprecision in Black-Box LLMs: A Cognitive Science Inspired Framework Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a novel framework for evaluating overprecision in black-box large language models (LLMs), drawing inspiration from human studies in cognitive science. The authors define overprecision as excessive certainty in numerical interval judgments. Their method proposes a three-phase procedure: (1) generating numerical confidence intervals from LLMs with prompt engineering, (2) refining these intervals using aggregation and self-refinement strategies, and (3) evaluating them with calibration and correlation metrics. The experiments (conducted in general knowledge, medical, and financial domains) reveal that LLMs are systematically miscalibrated, showing large gaps between requested confidence and actual coverage. Furthermore, the paper finds that interval lengths do not scale with confidence levels. Refinement strategies(self-refinement and CoT) offer limited improvement. The paper's main strength lies in its well-structured framework for evaluating overprecision in black-box LLMs. The proposed three-phase method is clear, systematic, and easily reproducible. The experiments are comprehensive, covering multiple domains and providing a detailed analysis of the results. The findings that LLMs are systematically overprecise and that their confidence does not correlate with their predictions are significant and contribute to a better understanding of LLMs. There are a few areas that could be improved. - The study is limited to two old models, all from OpenAI. It would be interesting to see if the findings generalize to other model families (e.g., Llama, Claude, and Gemini), closed- and open-sourced models, and reasoning models. - The refinement strategies explored are relatively simple. More sophisticated methods, such as those involving more complex reasoning or external feedback, could be explored. Both refinement strategies showed limited improvements. Although the problem of overprecision is important and novel, the proposed solutions seem underdeveloped. This limits the practical impact of the work. - The paper focuses on numerical answers, and it would be interesting to see how the framework could be adapted to other types of data, such as text or images. 1. The main question I have is how the proposed framework would generalize to other LLMs beyond the two OpenAI models tested. 2. The paper shows that CoT prompting has mixed effects on calibration. Do the authors have any hypotheses about why this might be the case? CoT is known to be a simple and robust method to induce the zero-shot reasoning ability of LLMs to improve the model's reasoning capability. It is less intuitive to see that it did not improve the uncertainty estimation since the uncertainty estimation task strongly involves a reasoning process. Could it be related to the complexity of the reasoning required for different tasks? 3. The self-refinement strategy did not yield improvements. The authors suggest that this is due to a narrowing bias. Could this bias be mitigated by providing the model with more diverse examples or by explicitly instructing it to consider a wider range of possibilities? Heavily AI-edited
Characterising Overprecision in Black-Box LLMs: A Cognitive Science Inspired Framework Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates whether LLMs can produce numerical confidence intervals that meaningfully correspond to a given confidence level (e.g., 90%). For example, when we ask LLMs to “Provide an interval that you are 90% sure contains the answer”, does the interval actually contain the answer 90% of the time? In order to test this, the authors propose a three-phase framework (generation → refinement → evaluation) for eliciting and assessing such intervals under imposed confidence levels. This setup is repeated across confidence levels (60%, 70%, 80%, 90%, 95%) and applied to datasets spanning financial reasoning (FinQA), medical QA, and general knowledge tasks. The study evaluates two black-box models (GPT-3.5-turbo and GPT-4o-mini) and introduces two complementary metrics—Deviation Score (DS) and Interval Length Score (ILS)—to characterize calibration behaviour beyond simple coverage rates Findings 1. Across all models and domains, empirical coverage (hit rate) is consistently below the stated confidence, indicating strong overprecision. 2. Interval widths show little to no correlation with confidence levels, suggesting that LLMs fail to internalize the concept of confidence intervals. 3. Calibration quality varies with domain, numerical scale, and prompt formulation, with larger deviations observed in financial and medical tasks. 4. Limited benefit of refinement: Simple union-based aggregation trivially improves coverage by widening intervals, whereas self-refinement tends to narrow intervals and further degrade calibration, revealing a systematic narrowing bias. 1. The problem itself is interesting, and the evaluation framework can support this motivation. 2. Metric improvements: DS and ILS meaningfully extend the analysis beyond raw coverage. 3. The consistent narrowing bias finding is an interesting empirical observation. 4. The writing, tables, and figures are polished and easy to follow. 1. Beyond the problem (i.e., evaluating LLMs regarding their ability to understand the confidence interval), the paper mostly refines and formalizes an existing evaluation setup rather than introducing new conceptual or methodological ideas. 2. Restricted scope: Only GPT-series models are tested; no comparison to open-source or white-box methods. 3. The cognitive framing is more decorative than explanatory; there is little theoretical connection explaining why overprecision occurs or how to mitigate it. 1. Do you see any connection between your framework and conformal prediction? 2. Is the “narrowing bias” prompt-dependent or model-dependent? Have you tried to explicitly ask the model to avoid this? 3. What would be required to make this framework predictive (i.e., useful for detecting unreliable outputs rather than just describing behaviour)? Heavily AI-edited
Characterising Overprecision in Black-Box LLMs: A Cognitive Science Inspired Framework Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper adapts the cognitive-science notion of overprecision—excessive certainty in interval estimates—to study LLMs. The authors propose a three-phase generation–refinement–evaluation framework: Generation – models generate numeric confidence intervals at imposed confidence levels. Refinement – intervals are aggregated or self-refined. Evaluation – empirical coverage and correlation between interval width and imposed confidence are assessed. Experiments with GPT-3.5-turbo and GPT-4o-mini across FinQA, MedQA, and MMLU show that LLMs are “overprecise”: intervals are too narrow, coverage is far below nominal confidence (e.g., only ~20% coverage for 95% intervals on FinQA), and self-refinement tends to make intervals narrower rather than better calibrated. The paper aims for descriptive analysis rather than optimization, claiming novelty in introducing a black-box, cognitive-science–inspired protocol for overprecision measurement. Conceptually interesting framing — connecting cognitive-science constructs (overprecision, interval judgments) with LLM calibration is a fresh angle, distinct from common probabilistic calibration or verbal confidence studies. Transparent methodology — the generation/refinement/evaluation pipeline is clear and easily reproducible from prompts listed in Appendix. Data variety — the inclusion of financial, medical, and general-knowledge datasets provides some cross-domain coverage. Negative results are valuable — showing that self-refinement and CoT do not necessarily improve calibration challenges prevailing assumptions in the confidence-elicitation literature. 1. Conceptual contribution is limited and largely descriptive. While the cognitive-science framing is novel, the work provides no theoretical development beyond restating the overprecision paradigm. The method is a direct adaptation of human interval-judgment tasks, not an original LLM methodology. The study yields descriptive statistics (hit @ c, correlations, DS/ILS) but no actionable insights for modeling uncertainty or improving calibration. The authors repeatedly stress that this is “not a benchmark or optimization study”, but this stance limits scientific value: the paper ends up confirming an already known fact — LLMs are miscalibrated — without explaining why or how to fix it. 2. Methodological limitations undermine interpretability. Lack of internal validation: imposing a nominal 90 % confidence and measuring empirical coverage is straightforward, but the protocol mixes sources of variation (model stochasticity, numeric reasoning errors, parsing errors) that confound true overprecision effects. Small model/sample scope: only two OpenAI models, one temperature, and narrow numeric subsets of three datasets; this makes conclusions about “LLMs” overgeneralized. Data filtering biases: converting multiple-choice medical questions into single numeric answers strips semantics and may distort task difficulty. Metrics are redundant: “Deviation Score” and “Interval Length Score” are simple normalized distances and widths; they do not meaningfully add insight beyond coverage. 3. Refinement experiments are weakly justified. The aggregation and self-refinement procedures are ad hoc and disconnected from cognitive-science theory. “Union” trivially improves coverage by widening intervals — hardly evidence of cognitive correction.The “self-refinement” mechanism reveals that models overwhelmingly pick the narrowest interval, but the paper does not probe why this occurs or whether prompt wording causes it. The claim that self-reflection mimics peer-judgment correction (Haran et al. 2010) is superficial and empirically unsupported. 4. Statistical rigor and presentation issues. No statistical significance tests, confidence intervals, or effect sizes — only tiny decimal differences (±0.2 %) reported with three decimals in Tables 2–4 pp. 6–7, which exaggerate precision. Figures 2 and 3 (pp. 8 & 16) are visually cluttered and fail to convey new insights. Some tables (e.g., Table 2) misinterpret correlation magnitudes < 0.01 as meaningful; these are essentially zero. Sample size after filtering is unclear — some datasets drop to only ~1–3 k examples (Table 1 p. 5). 5. Limited originality relative to prior work. The paper positions itself as the first to study overprecision in LLMs, but prior studies on numerical calibration, interval estimation, and uncertainty (e.g., Xiong et al. 2024; Wen et al. 2024; Shrivastava et al. 2023) already evaluated similar ideas with probability or interval formats. The difference here is largely terminological (borrowing “overprecision” from psychology) rather than methodological. 6. Weak insight and discussion. Section 6 (p. 9) summarizes that miscalibration persists and CoT/self-refinement give mixed results — conclusions that add little beyond previous literature. No deeper analysis (e.g., linguistic factors, reasoning depth, or token-level uncertainty) is attempted. The discussion reads as observational rather than explanatory. Figures 5–6 (pp. 18–19) merely restate known dataset difficulty orderings. How do you ensure parsing correctness of numeric intervals in model outputs? Could failures inflate apparent miscalibration? How many prompts per sample were used, and how sensitive are results to temperature or phrasing? Could the “overprecision” pattern simply reflect model under-dispersion due to deterministic decoding, rather than a cognitive-style bias? What is the advantage of interval elicitation over directly sampling numeric uncertainty (e.g., via logits or surrogate ensembles)? Can the protocol scale to non-numeric tasks, or is it limited to trivial numeric Q&A? Fully AI-generated
Characterising Overprecision in Black-Box LLMs: A Cognitive Science Inspired Framework Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. **Summary**: This work proposes to investigate model’s calibration in terms of *overprecision*, i.e., whether the model conveys excessive certainty in one's estimate. To this end, and focusing on numerical output tasks, the paper investigates whether LLMs are able to adjust the numerical interval based on a fixed confidence interval (e.g., “Provide an interval that you are $c$% sure contains the answer”). Experiments are conducted using gpt-3.5-turbo and gpt-4o-mini and 3 datasets (MMLU, FinQA, MedQA and MedMCQA). 1. Novel perspective on examining uncertainty quantification, merging interesting concepts from cognitive science frameworks; 2. Experiments concern different domains, including both general more knowledge (i.e., MMLU) and more domain-expertise focused datasets (i.e., MedQA, MedMCQA, FinQA) - W1. **Motivation for the need of studying overprecision in LLMs is insufficient**: the paper mentions that studying “overprecision in black-box LLMs is crucial” (lines 43-44) but does not mention why or the implications to the field. - W2. **Some statements (and claims) in the paper do not seem to be supported or well-motivated**, raising questions about the soundness of this work (see Question section below). - W3. While providing novel dimension to miscalibration, this cognitive-science inspired framework is limited to the short numerical answers. It is unclear how such results would generalize in open ended generation. - W4. Focuses evaluation on single model family, offering limited insights about generalization in other models. Evaluated models (GPT-3.5 and GPT-4) employ multi-digit tokenization (e.g., numbers in [0, 100] are represented using 101 different tokens). However, more recent models (e.g., Gemma 2, Llama 3, OLMo) adopt single digit tokenization. These may exhibit different biases ([Singh et al 2024](https://arxiv.org/abs/2402.14903)), so it could be insightful to add experiments with different model families and across model sizes. **Questions**: 1. Lines 100-101 refer to limitations of self-reported confidence including variability with prompt wording, sampling randomness, linguistic biases and because of which may represent unreliable measures of true model uncertainty. Is there empirical evidence that this is the case? Experimental results or adequate citation should be provided to ground such claims. a. Similarly, in lines 59-60, the authors mention limitations of verbalized self-reports mentioning that existing methods do not ensure that stated probabilities correspond to empirical frequencies. It would be great if experimental results or relevant citations are added to back these arguments. b. Related to the previous comment, other peer-reviewed work ([Xu et al 2024](https://aclanthology.org/2024.emnlp-main.343), [Lyu et al 2025](https://ojs.aaai.org/index.php/AAAI/article/view/34120)) seems to be relevant for this discussion, as they propose to calibrate verbalized confidence scores to empirical frequencies. I wonder how this affects the paper’s argument, since these papers provide a way of generating self-reporting scores that are aligned with the empirical frequencies. c. Lines 212 shed light on how focusing on numerical outputs helps reduce the influence of linguistic biases such as positivity biases. But there may be still other biases present, such as generating numbers ([Shao et al 2025](https://openreview.net/forum?id=AOe1aUhEQQ))). 2. The paper mentions that “to mitigate these issues”, the confidence specification is shifted to the prompt (lines 101-102) by imposing explicit confidence levels and evaluating whether intervals align. However it is unclear to me how this addresses the previously mentioned limitations for verbalized confidence (sampling randomness, word sensitivity, and linguistic biases). If I understand correctly, none of the experiments in the paper (or appendix) provides support for the claim that specifying confidence in the prompt leads to more robust results. Perhaps the authors can help clarify any misunderstanding I may have. 3. In Section 4.1 the lowest confidence value considered for the generation phase is 60%. Is there a reason why lower values (e.g., 20%, 40%, 50%) were not used for evaluation as well? 4. There is an assumption that during Phase 1 (Generation) the LLMs always generate an interval. Was this empirically validated? How often did the LLM generate some answer that was not an interval? How do you ensure a consistent output format? a. . Can you specify the generation configurations for Phase 1 (Generation)? The configurations are mentioned for Phase 2 (refinement) but I could not find them for Phase 1. 5. Interpretation of results and metric choice: Results in Table 2 seem to be constant irrespective of the prompted confidence level. My understanding of the hit@$c$ metric is that it is a 0-1 metric considering only whether the value $c$ lies in the specified interval. However, it doesn’t provide an idea of whether the intervals are systematically to the left or to the right of the desired confidence intervals. Can the authors share some insights about this? a. Such analysis can help provide additional evidence to support the claim in Section 5.2 that “LLMs [...] remain insensitive to confidence cues” (lines 360-361). Especially given that the direction of deviation is not currently being accounted for by any metric – which could potentially provide some useful signal. 6. The claim “results show a widespread miscalibration (overprecision) across datasets and models” in the caption of Table 2 (lines 294-295) seems to not fully capture the observed patterns. If I understand correctly, models are actually miscalibrated (underprecise) for confidence values of 95% and 90% but overprecise for confidence values of 60%, 70%, and 80%. **Clarity**: - It was not clear to me what “refinement strategies” meant when reading through the introduction. It may be worth clarifying that. - Can you clarify which aggregation (or refinement strategy) was used to report the values in Table 2? - Are standard deviations values reported in Table 2 expressed in the same unit as the mean? They appear to be very small compared to the absolute value of the hit@c metric. - How is performance reported in Table 4? Consider adding such information to the caption. **Formatting**: - Wrong citation format is being used throughout the paper. - Table 1 format: missing top and bottom row. - Figure 2 captions are difficult to read. Consider adding whitespace around the “|” character. - Figure 2 legend’s font size is too small and difficult to read. **Missing citations**: - Section 2.2.1 is missing a citation to the work from [Lin et al 2022](https://openreview.net/forum?id=8s8K2UZGTZ), which is one of the first methods exploring the use of words to express uncertainty. **Typo**: - Line 26: “Refinement” → “refinement” - Line 140: “ (q_i, a_i)_i” → “ \{(q_i, a_i\}_{i=1}^N ” is more commonly used as the notation of a set of questions - Line 140: $qi$ → $q_i$ - Line 322: “pp” → “percentage points (pp)” - Line 323: The expression “are much too narrow” sounds ungrammatical in this context. **Suggestion**: - As a subjective preference, it would be more appealing if lines 27-30 could motivate the importance of this study for the field, as opposed to describing it as a descriptive study. I.e., how can this analysis or the findings in this paper impact the field? - Add citations (whenever possible) to the metrics used in the study. For instance, when mentioning how “both measures follow established practice in cognitive psychology studies of overprecision”, it could be useful to add citations to said words in cognitive science. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next