ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (100%) 6.00 2.67 3222
Total 3 (100%) 6.00 2.67 3222
Title Ratings Review Text EditLens Prediction
Critical Confabulation: Can LLMs Hallucinate for Social Good? Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper uses LLMs’ predictions in a narrative cloze task to show that LLMs can plausibly speculate missing information about historical figures. The authors relate this use of LLMs to “critical fabulation”, a practice in African American studies where scholars repair missing information in archives through storytelling. I feel like this paper is so unlike most of the ICLR paper I’ve seen in recent years, especially with its use of historical archival data and its engagement with African American Studies. I also appreciate the authors’ thoughtfulness around experimental decisions. Though this work is a rare interdisciplinary blend of technical and substantive work, I worry that the way the authors explain their methodology and results may be a barrier for this work actually being useful for “social good”. That is, true “social good” should make research accessible and communicable to the communities associated with the datasets involved. Otherwise, such work is at risk of being extractive. Even as someone with a technical background, I found that there were some parts of the paper that I had to read very very carefully to understand. The authors do not have to necessarily decrease the technicality of their work, but to reconsider how they communicate and illustrate their methods and key takeaways. The authors use a lot of paper space to justify why OLMo-2 is their model that they want to focus their analysis on, and why auditing for data contamination is important, only to find out that memorization might not be an issue because OLMo-2’s performance isn’t that different from its unaudited peers. I wonder if the structure/narrative of this paper could be reorganized so that the amount of attention dedicated to certain concepts is more correlated with the potential impact of findings. The main paper is very dense already; some things could be deprioritized into the appendix. In addition, I found that the beginning of this paper to introduce the interesting idea of “critical confabulation” and its implications, but the paper ends abruptly without reconsidering how some of these earlier themes may relate to the model-evaluation-based findings of this paper. On a related note, I am curious what differentiates a “critical confabulation” from simply a “confabulation” – how do you know that what models are predicting is in any way “critical”? This paper, with its “social good” framing, repeatedly suggests that their use case of LLMs is desirable for humanities scholars. I totally understand why critical fabulation is useful for scholars. However, I remain unconvinced as to why humanities scholars would actually want this process to be LLM-ified, rather than relying on their own expertise and lived experiences. Is it because an LLM could be useful for proposing a range of plausible ideas? I worry that you are making a tool just to see if you can, rather than actually fulfilling a sincere need. A minor note: the title of this paper is too vague given the paper’s scope, and may not help this paper reach relevant audiences. Could you provide some concrete examples of knowns unknowns and unknown unknowns? How are models’ confabulations considered “evidence-bound” (line 90, line 105, line 163) if we don’t know what kind of evidence they are using for their predictions? Are models really evidence-bound if they are making up information that may not exist in their pretraining data? The text in Figure 1 is too small to see unless one zooms in a lot in a digital pdf viewer. Ditto for axes labels in Figure 2. In Figure 1, it’s unclear what the text saying “[MASK]” is doing in the figure. The width of Table 2 exceeds the margins of the page. Also the caption of this table could benefit from a one-sentence takeaway for people who skim the paper. Line 97: the citation for Chambers & Jurafsky is in the incorrect format, e.g. \citep{} instead of \citet{} maybe? Line 253: How did you know if a document was in English or not? Line 350: What made the three annotators qualified to do this annotation task? Who were these annotators? Ditto for the annotator in line 416. Do you have to use “hallucinations” repeatedly to describe LLM behavior in your use case? Though this word is now very common and normalized in AI research, it still anthropormophizes LMs and diminishes the words’ origins as a serious human mental illness symptom. It is jarring to see this word upheld so prominently in a “social good”-related paper. Fully human-written
Critical Confabulation: Can LLMs Hallucinate for Social Good? Soundness: 4: excellent Presentation: 2: fair Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose critical confabulation: using constrained LLM “confabulations” to fill archival gaps in ways informed by humanities practice (Hartman’s critical fabulation). They operationalize this as a narrative cloze task over a Black-writing archival collection (BWTC): mask events in a character timeline and ask models to generate plausible evidence-bound fill-ins. They evaluate a range of audited open models (OLMo-2 family) and baselines (Gemma, GPT-4o, etc.) under prompting conditions intended to elicit controlled hallucination. Automatic metrics (cosine similarity, retrieval overlap etc.) and some qualitative examples are used to argue that constrained confabulation can produce plausible, useful candidate narratives. I think this paper raises a very interesting and socially meaningful problem, and I love to see how people connect LLM capabilities to social good in unexpected ways. I think in order to do that, the authors have to really make sure that the task they choose for hallucinations should not be contaminated, and much of the heavy lifting in this paper is done at finding that task and data, and the authors have accomplished with careful justifications of data contamination (Section 3.2). I also love that the authors have spent great efforts in curating the dataset and diverse ablations with different prompting strategies. I also like that the paper even tests and analyzes settings between unaudited models. I think the paper would benefit from more qualitative examples and analyses, like showing the failure modes of LLMs and try to analyze the hard data where most models are unable to get them right. In addition, the presentation can be improved. For instance, many figures in the paper have really small fonts and are very hard to see. Figure 1 can also be more dense and displayed with larger font, as I have to zoom in on a laptop to figure out what is being displayed. Have you thought about changing the temperature of LLMs to nonzero? Can you set seeds for the LLMs to still control reproducibility? Fully human-written
Critical Confabulation: Can LLMs Hallucinate for Social Good? Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents an LLM-based methodology for inferring historical narratives, given some historical grounding information. First, the paper establishes the epistemological roots of critical fabulation to situate the utility of their method. The paper then details the dataset used and how that dataset was cleaned to prevent contamination of the data with the model’s training data. Then, the paper tests how well LLMs can perform a confabulation task of reconstructing narratives. The paper finds that while LLMs are performant in this task, there is still a lot of room for improvement in future work. The work is very novel and has very good quality in its empirical grounding. For novelty, the paper applies an LLM to a novel domain, namely the study of history. I am not aware of many applications of AI or machine learning to this domain. Additionally, there are also few works that exploit the hallucination/confabulation ability for academic pursuits. Many academic pursuits are based in facts and substantiated claims, so the use of the creative part of an LLM in the academic domain is both counterintuitive and novel. For the quality of the paper, the data cleaning procedures are very thorough and detailed. The paper does a good job of controlling for things like data set contamination. Additionally, the paper also considers several variations in prompts and masking of events to fully understand the performance contours of LLMs for the narrative closure task. The paper could benefit from some clarity enhancement. For example, I have some trouble picturing the differences between the validation cloze tasks. I think paper would benefit from providing some concrete examples of full timeline $\rightarrow$ partial cloze set up $\rightarrow$ n-gram cloze set up (maybe in an appendix). Additionally, it's not clear why some of the thresholds are what they are. For example, when extracting name candidates, why do you take the top 10,000 persons, and why is less than 51 the frequency to keep? Outside of some clarity enhancements that could be made, I think the method would be better if there were some sense of certainty in the predicted events. As seen in this research, this is a challenging task. And, previous research has found humans tend to overestimate AI’s capabilities. So, when these two things combine for a historical task where the confabulations are attempting to fill in for historical truth, I worry the method could result in bad or even harmful confabulations that could distort the historical picture for humans more than the current, partial view of history for unseen peoples and classes of people. Thus, I would surmise that having some estimates of confidence could help humans to better judge whether the confabulations are good or not. 1. For curating the list of unseen names, how did you disambiguate between different personalities with the same name? 2. Where do the event types come from? 3. How did you deal with differences in time scales within a narrative trajectory (e.g., some events might tightly cluster together with long gaps of time between event clusters)? Fully human-written
PreviousPage 1 of 1 (3 total rows)Next