ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	3.00	3.50	2562
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	2 (50%)	5.00	3.00	2649
Total	4 (100%)	4.00	3.25	2605

Title	Ratings	Review Text	EditLens Prediction
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper builds a benchmark called EHR-ChatQA to test how well LLM agents can answer questions over EHR databases in a real conversation, not just single direct questions by focusing on cases where the user starts vague, clarifies over time, or changes their question mid-way — like how clinicians actually talk in real world. The benchmark requires the model to interpret the question, ask clarifying questions, write SQL, and return an answer. The contribution is mostly - a realistic test setup + showing there’s still a big gap in making LLM agents dependable for EHR querying. I believe the authors thought well and paper is original in framing EHR QA as a conversational, iterative task rather than single-turn SQL. The quality of the benchmark construction is solid, though it could be better validated. The writing is clear, making the setup and results easy to follow. I do think the significance is moderate — it highlights an important reliability gap, but the contribution is mainly diagnostic rather than solution-oriented. I believe the analysis focuses mostly on pass/fail metrics, without deeper breakdowns of failure modes or model behavior, which limits insights into improving performance. Finally, the work emphasizes problem framing rather than proposing strategies to address the reliability gap, so the contribution feels diagnostic rather than forward-moving. Couuld the authors clarify how the proposed method handles edge cases or scenarios where the stated assumptions do not hold?	Fully AI-generated
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces EHR-ChatQA, a new interactive benchmark for evaluating Large Language Model (LLM) agents accessing Electronic Health Record (EHR) databases, specifically addressing the practical challenges of query ambiguity and value mismatch common in clinical settings. EHR-ChatQA assesses the full end-to-end agentic workflow, including conversational refinement using an LLM-based user simulator, active schema and value tool use, and accurate SQL generation, across two distinct interactive flows: Incremental Query Refinement (IncreQA) and the more adaptive AdaptQA. Evaluations conducted across state-of-the-art closed- and open-source models reveal a critical robustness gap (Pass@5 versus Pass^5), exceeding 35-60% in the challenging AdaptQA flow, demonstrating that current agents, while occasionally capable, are fundamentally unreliable for safety-critical EHR tasks. - The benchmark is the first interactive evaluation specifically designed for EHR question answering, holistically assessing the full agentic workflow including tool use and conversational refinement. - Construction of the task instances is rigorously grounded in real-world clinical QA scenarios, utilizing publicly available EHR databases with renamed schemas to enforce genuine exploration rather than memorization. - A sophisticated simulation environment employs a stochastic LLM user and a dedicated LLM-as-a-judge validator to mitigate simulation noise and ensure evaluation fidelity. - Evaluation metrics centered on consistent success (Pass^k) and the resulting robustness gap (Gap-k) provide vital diagnostic insights into agent reliability, critical for safety-sensitive applications. - The Adaptive Query Refinement (AdaptQA) flow, which represents the more challenging and novel adaptation scenario, comprises only 64 instances, potentially limiting the statistical robustness of evaluations in this critical area. - Although the user is simulated by an LLM, its conversational spontaneity is heavily constrained by strict behavioral rules (Table 5), raising questions about the fidelity of the simulation to real clinical dialogue complexity. - Evaluation relies significantly on extensive, database-specific SQL generation rules (Tables 8 and 9) provided to the agent, suggesting the benchmark may test adherence to prompting constraints rather than intrinsic database reasoning ability. - Performance of open-source models on AdaptQA is critically low (0.0% Pass^5), which diminishes the utility of the benchmark for diagnosing failures in current open LLM architectures. - More quantitative diagnostic insights are needed, detailing which specific types of user utterances or tool outputs most frequently trigger the observed dramatic drop in consistent success (Gap-k). - Lack of transparency regarding the hyperparameters and thresholds used for the `value_similarity_search` tool makes replication or external validation of agent interaction logic difficult. - Given the importance of AdaptQA in revealing the robustness gap, what efforts are planned to expand its size to improve the statistical reliability of results for this difficult and novel task type? - Could authors provide a more detailed, quantitative breakdown of the error types (Value Linking vs. SQL Generation) specifically correlated with the Gap-k metric, differentiating brittle failures from consistent failures? - Since simulation validity is crucial, please provide a few detailed examples of dialogues deemed "invalid" by the LLM-as-a-judge validator to better illustrate how the validator enforces the specified user behavior rules.	Fully AI-generated
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors seek to improve the benchmarking of language model agents in EHR-relevant database tasks, such as clarifying questions, solving value mismatches, and generating accurate SQL. They find that the current LLMs they test struggle with consistent success in this regard. They describe the method of benchmark creation as well as testing, including the use of LLM-based validation. - I commend the authors for engaging with this particular topic within the Health AI field - optimizing the nuts and bolts of back-end EHR agent interaction will be very important to ensuring these tools are practically useful. Their recognition of the inherent ambiguity and clinical context-specificity of clinician questions is also important. - The authors' introduction includes a reasonable summary of the relevant literature in the field and the purpose of their benchmark. - The figures are clear and add to the quality of the work. The overall workflow and structure is quite clear. - The core QA tasks are reasonable, and the level of human annotation appears overall strong (although I would love further detail on the "38 graduate-level contributors" and what was actually done). - I greatly appreciate the depth of analysis the authors offer regarding the nature and presumed causes of errors. This strongly supports the validity of this system as a useful benchmark. - In general, this paper appears to make a strong contribution to the broader literature. - It is always difficult, when presenting such a benchmark, to know the extent to which the failures reported by the authors (e.g. the worse Pass^5 vs Pass@5) are inherent to the model vs related to flaws in the authors' implementation of that model. While this is inevitable to benchmarking, I do think it should clearly be acknowledged that these results represent a plausible floor, but not at all a ceiling for that which the authors seek to evaluate. - Benchmarks using LLMs as user-agents have the same set of concerns, regarding the prospect for the questioner-LLM to inadvertently "tip off" the answerer-LLM, but that does not render them useless in this regard. The authors overall do a good job of highlighting these errors in their section 6, but perhaps this should also be discussed. That is, there are failures which make the test too difficult, but there also may be those which make it too easy. - Perhaps more a direction for fture work, but as a clinician I feel that these systems can be further elevated through the use of clinical context itself to improve the questioning process, rather than just general clinical knowledge. - I am quite concerned, however, about the authors' use of the LLM-as-judge paradigm without any human annotated gold-standard set for whether the validator actually achieves its goals. How are we to know, for example, whether these invalidations are correct, or whether this "validator" is itself discarding valid behaviors? Quis custodiet ipsos custodes? This use of unvalidated LLM-as-judge is very concerning here, and I recommend that either further validation of this approach is offered, or the section is removed. 1. Who were these 38 graduate-level contributors? How were they trained? What is their background? What was their specific input? How was the "LLM-as-judge" validator system itself validated? What were the impacts of implementing this system?	Fully human-written
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a new benchmark to better capture the real-world complexities of text-to-sql queries on EHR databases. They propose an environment leveraging LLMs that simulate real user interactions. By going beyond static benchmarks, the paper argues to better validate the query ambiguity and value mismatch, and their interactive resolution that are common in practice. Disclaimer: I am not well aware of text-to-sql datasets let alone EHR-specific ones. My assessment is based only on the paper's content. It is possible that I missed any important related work. The paper is well-written and easy to follow. The motivation for EHR text-to-sql is clear and the need for going beyond static benchmarks is also imparted well. The paper carefully controlled contamination by changing the names of tables and columns in the database. They also sourced database and queries from real hospitals, which makes the dataset interesting. W1) Details on how they simulated the two user interaction workflows: IncreQA and AdaptQA in Sections 4.2.2, 4.2.3 are not clear to me. The authors should consider elaborating or elucidating with an example. W2) I also could not follow how they validated the final response from the interaction. Since the queries can be altered mid-way (in AdaptQA) somewhat arbitrarily, how's the evaluation done? W3) In order to establish the reliability of dataset, it is required that the authors report some validation to ensure that the numbers reported in Table 3 are not inflated. In other words, how many of the interactions are misjudged due to failures of user-simulator, user validator, or response validator? W4) The proposed benchmark has a heavy system component because they are proposing an environment for validation. But I did not find system setup instructions in the supplementary folder of main paper. Please comment on the questions mentioned in the Weakness section.	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next