ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	0 (0%)	N/A	N/A	N/A
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	2 (67%)	6.00	3.50	2792
Fully human-written	1 (33%)	4.00	4.00	3343
Total	3 (100%)	5.33	3.67	2975

Title	Ratings	Review Text	EditLens Prediction
Interactive Agents to Overcome Underspecificity in Software Engineering	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper investigates how large language model (LLM) agents handle underspecified instructions where bug reports or feature requests omit critical details. Building on the SWE-Bench Verified dataset, the authors include three settings: (1) Full, where complete issue descriptions are given; (2) Hidden, where GPT-4o-generated summaries remove key information; and (3) Interaction, where agents can query a simulated user (GPT-4o) that has full task details under Hidden setting. The study evaluates four across three capabilities: detecting missing information, asking clarification questions, and leveraging answers to solve tasks. Results show interaction boosts success rates by up to 15.4% over non-interactive in hidden settings, though performance still lags behind fully specified inputs. Most models fail to detect underspecificity without explicit prompting, with only Claude Sonnet 3.5 achieving 84% accuracy under moderate encouragement, while Claude and DeepSeek ask more specific, information-rich questions and Llama tends to pose generic ones. 1. The paper explores a critical issue within current LLMs where they typically cannot recognize the underspecioficity in user query. 2. The paper clearly defines underspecificity as “missing information that prevents an expert from producing a correct fix,” grounding it in the SWE-Bench Verified rubric rather than using vague notions of ambiguity 3. The study divides performance into three measurable capabilities: 1) detecting underspecificity, 2) asking targeted questions, and 3) leveraging responses, which enables more diagnostic evaluation rather than a single overall score 4. Resolve-rate improvements are validated using Wilcoxon signed-rank tests to confirm significance across models. 1. The paper admits that naturally underspecified GitHub issues often still contain concrete technical cues (error messages, file references, conversational fragments), whereas the generated summaries mainly remove details, which may exaggerate the severity of underspecificity and may bias the task toward “missing vital context” rather than “ambiguous intent.” 2. The paper mentions using the OpenHands agent environment but gives minimal explanation of how the agent framework is structured or how its components (planning, editing, execution, and interaction) collaborate. 3. In Section 3.3, the discussion of navigational gains lacks clarity: what counts as navigational information, why it matters for task success, and whether such behavior mirrors how real developers seek information. 4. The simulated user based on GPT-4o is insufficiently validated; the paper provides no evidence that GPT-4o’s responses align with real-user behaviors or communication patterns. And there is lack of analysis about the correctness of the response to the clarification questions generated during the interaction setting. 5. In Section 5, information gain is defined as 1 – cosine similarity between the summarized task and the cumulative knowledge after interaction. This metric may overestimate improvement, as asking many irrelevant questions could still yield a high score. Would it be more accurate to calculate the score between the fully specified knowledge and the cumulative knowledge after interaction? 1. How is the agent framework structured? 2. What exactly constitutes navigational information, why is it important for task success, and does this behavior reflect how real developers seek information? 3. How well does the simulated GPT-4o user align with real-user behaviors and response patterns, and how accurate are its answers to clarification questions during interaction? 4. Would it be more appropriate to compute information gain between the fully specified knowledge and the cumulative knowledge after interaction, rather than using the summarized task as the reference?	Lightly AI-edited
Interactive Agents to Overcome Underspecificity in Software Engineering	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper evaluates how well LLM-based agents handle underspecified instructions in software engineering tasks. Using SWE-Bench Verified as a foundation, the authors create synthetic underspecified versions of GitHub issues and test whether models can (1) detect missing information, (2) ask effective clarifying questions, and (3) leverage interactions to solve tasks. They evaluate four models (Claude Sonnet 3.5, Claude Haiku 3.5, Llama 3.1 70B, Deepseek-v2) across three settings: fully specified issues, underspecified issues without interaction, and underspecified issues with a simulated user proxy. 1. The paper addresses a practically important problem. Real-world task descriptions are often incomplete, and understanding how agents handle this is valuable. 2. The experimental design is generally rigorous. 1. The most significant weakness is the lack of human validation for the synthetic underspecified issues. The authors use GPT-4o to generate summaries but provide no evidence that these summaries would actually prevent human experts from solving the tasks. Are the findings representative of real underspecification? 2. The classification of missing information into only "informational" and "navigational" details is overly simplistic. The authors mention "multiple, interdependent gaps" in real tasks but do not provide a formal taxonomy or analyze which types of underspecification are most challenging. You cite several papers on ambiguity resolution and clarification questions, but don't compare against them. Do you think those methods can be applied to the datasets for comparison? Interactive approaches require multiple API calls, longer execution times, and user attention. Can you provide cost analysis?	Lightly AI-edited
Interactive Agents to Overcome Underspecificity in Software Engineering	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper asks how well agents perform when faced with an underspecified or ambiguous question, where the ideal solution/user experience would require eliciting additional information from the user. The paper repurposes an existing benchmark, SWE-bench, and turn it into interactive / hidden-information benchmarks, by rephrasing and processing the inputs with LLMs. The paper asks three well-formulated research questions, which are in turned answered. 1. In my view, RQ1 and RQ2 are really interesting and well-scoped. I appreciate the experimental design, which attempts to build comparable evaluations across "Hidden", "Interaction" and "Full" settings. The primary results in Figure 3 are super interesting, and I believe potentially very influential in the field of software-engineering evaluations. However, I would really like to see a more comprehensive set of models here. 2. The construction of the dataset is well-described and I appreciate the careful analysis in section 2.1 that elicits qualitative differences between human-written underspecified problem statements and the synthetically rewritten issues. 1. Outdated models: Both proprietary (Claude-3.5) and open-source models (DeepSeek-v2 and Llama-70b) are unfortunately rather behind the state of the art, given the rate of progress in recent months. Just for Claude models alone, we've seen Sonnet-3.7, Sonnet-4, and Sonnet-4.5 in the meantime. For this paper to be relevant for a conference presentation, I fear that the paper would really require updated results from more up-to-date frontier models. This will also make the claims around open-weight vs. proprietary models more relevant. In fact, I am intruiged if the type of "interaction" gap in the paper still remains with the current generation of models. 2. I find the results and methodology for RQ3 "Can the LLM generate meaningful and targeted clarification questions?" slightly underwhelming, since the main message seems to be that Llama 3.1 70b is not very good at question answering. The evaluation metrics do not really seem to be able to reveal fine-grained differences in question-asking abilities. The qualitative analysis is nice, but in my opinion lacks comprehensive documentation of full problem statements and agent traces featured in the appendix. 3. The paper relies quite heavily on SWE-bench for the original problem statements, which is quite heavily targeted by model creators. I think an analysis of how results would transfer to a more niche software engineering settings would be very interesting. > We did not evaluate on naturally underspecified SWE-Bench examples because they lack the paired ground truth (complete specifications) necessary for causal measurement of interaction impact. Would it be possible to infer the complete specification from the known `gold_patch` and `test_patch`? How would this approach compare to the chosen approach in the paper? > However, there are naturally occurring underspecified issues that are similarly vague as well (django_django-13952, django_django-15744, pytest-dev_pytest-7283, sphinx-doc_sphinx-9467, sympy_sympy-12977 are some specific examples) I would encourage the authors to feature and discuss these examples in the appendix, and simply refer to the appendix here. This is a rather distracting amount of information for page 3.	Fully human-written

PreviousPage 1 of 1 (3 total rows)Next