ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 3.00 2618
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 6.00 3.00 4639
Fully human-written 1 (33%) 8.00 5.00 4533
Total 3 (100%) 6.67 3.67 3930
Title Ratings Review Text EditLens Prediction
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces CounselBench, a clinician-grounded benchmark for open-ended mental-health QA composed of two complementary parts: - CounselBench-Eval: 2,000 expert evaluations produced by 100 licensed or professionally trained mental-health practitioners, who rate and annotate responses (LLMs + human therapists) to 100 real patient questions across six evidence-based dimensions: overall quality, empathy, specificity, factual consistency, medical advice, and toxicity; with span-level evidence and written rationales. - CounselBench-Adv: an expert-crafted adversarial set of 120 prompts designed to prospectively elicit clinically meaningful failure modes observed in CounselBench-Eval. The pipeline samples real questions from CounselChat, obtains model and human responses, and collects blind expert ratings (five raters per item), enabling fair comparison and inter-rater reliability analysis. The authors find that state-of-the-art LLMs can score well on several quality dimensions, yet recur in clinically salient errors. Failure mode distributions exhibit model-family patterns and can shift across major releases. - **High-Quality Evaluation and Scale**: The paper's most significant contribution is the rigor and scale of its evaluation. It did not rely on students or crowdworkers, but instead recruited 100 mental health professionals to provide 2,000 evaluations. These 100 professionals hold 32 distinct license/degree types and span 43 specialization areas, ensuring the professionalism and diversity of the results. - **Clinically Grounded Evaluation Dimensions**: The study moves beyond simple "accuracy" to define six clinically crucial dimensions: overall quality, empathy, specificity, factual consistency, medical advice (safety), and toxicity. These dimensions were developed based on clinical literature and expert consultation, truly capturing the core of mental health QA. - **Innovative Adversarial Benchmark**: The second component, COUNSELBENCH-ADV, is highly innovative. It moves beyond passive evaluation by analyzing failure modes identified in COUNSELBENCH-EVAL (e.g., GPT-4 suggesting specific therapy techniques, Llama-3.3 being judgmental, and Gemini-1.5-Pro being apathetic) . Experts then authored 120 new questions specifically designed to trigger these identified failures, creating a systematic way to actively stress-test models. - **Key Falsification of "LLM-as-Judge"**: As the "LLM-as-Judge" paradigm gains popularity, this paper provides a powerful demonstration of its limitations in high-stakes domains. The study found that LLM judges "systematically overrate model responses" and "overlook safety concerns identified by human experts". For instance, LLM judges failed to identify most toxic content or factual errors. This finding serves as a critical warning for the field. - **Representativeness of the "Human Baseline"**: This is a limitation of the data source, not the study's methodology. The "human therapist" baseline was sourced from the "top-voted answer" for each question on CounselChat. The authors acknowledge that these "forum contributions are informal and vary in quality". Curiously, LLaMA-3.3 outperformed this human baseline on five of the six dimensions. This makes the results hard to interpret: have LLMs surpassed human performance, or have they simply surpassed a (potentially low-quality) standard of informal forum answers? - **Limitation of a Single Data Source**: All patient questions were sourced from a single public forum, CounselChat. The style of questions and needs expressed in this asynchronous, public format may differ significantly from those in private therapeutic conversations, EHR messages, or peer-support forums. - **Reliance on LLM-as-Judge in Adversarial Evaluation**: This presents a core methodological contradiction. In COUNSELBENCH-EVAL, the paper demonstrates that LLM-as-Judge is unreliable. However, to assess the 3,240 responses in COUNSELBENCH-ADV, the study used GPT-4.1 as a "scalable labeling tool". While the authors validated this as a "practical proxy" via a human spot-check (72.9% agreement), this reliance undoubtedly weakens the reliability of the adversarial results. In the first part of your paper, you convincingly demonstrated that LLM judges "systematically overrate" responses in the mental health domain. Why, then, did you choose to use GPT-4.1 as the primary evaluator for the 3,240 responses in COUNSELBENCH-ADV? Despite the 72.9% agreement, does this not imply that the failure rates in Table 3 might be underestimated, given that the evaluation paradigm itself was one you proved to be unreliable? Lightly AI-edited
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces a new benchmark for evaluating LLMs on open-ended mental health question answering (QA) questions. The questions are sourced from 100 real, patient-facing mental health questions posted on CounselChat forum. 100 licensed mental health professionals rated responses from 3 LLMs and the original human therapist response and evaluated across 6 dimensions, quality, empathy, specificity, medical advice, factual consistency, toxicity. The expert annotations include not only numeric ratings but contain span-level evidence extraction and written rationales. An additional dataset is created by 10 of the experts designed to trigger specific failure modes identified from the benchmark evaluation. * The benchmark is sourced from a forum where responses were provided prior to 2022 and is highly likely from a trained human therapist and is co-designed with mental health professionals. The six evaluation dimensions are clinically grounded in clinical psychology literature. * The benchmark contains evaluation from 100 licensed mental health professionals encompassing 32 license types and 43 specialization areas. There is rich annotations and represents one of the largest expert-annotated datasets for open-ended mental health QA but also has high interannotator agreement. * There are two benchmarks that covers not only systematic evaluation of the LLM and human responses, but an expert-authored adversarial set designed to stress-test LLMs. * Provides a critical evaluation of LLM-as-Judge paradigm highlighting the inability to flag problematic content and safety concerns identified by human experts. * Multiple pilot studies were introduced to refine the evaluation protocol with thorough analysis of annotator behavior and clear methodology provided. * There are slight scale limitations of the dataset as there are only 100 questions (5 per topic, 20 topics) which were chosen by popularity and may not fully reflect the diversity and subtly of mental health questions (as motivated by in the introduction). Similarly, the adversarial benchmark contains only 120 questions, although there are 20 per failure mode. * There is a heavy reliance on automated evaluation of the adversarial responses which seems to go against the findings of the LLM-as-Judge critique. The same model was also used to evaluate the failure modes that were identified in the original responses of the first benchmark (GPT-4.1 and GPT 4, respectively). * The adversarial benchmark, while interesting from a high-level perspective, is confusing. It seems to determine whether high trigger rates indicate model problems as determined by GPT-4.1. However, it's not clear what makes a question adversarial versus difficult? Also, it has been designed from 2 failure modes from the 3 original base models but then tested across new models (like Claude). Is the expectation that they are model-specific problems or really universal vulnerabilities? * Is it truly fair to benchmark against the human baselines as they were often informal and vary in quality as they could be answered anonymously. Are these fair comparisons with one another? Minor: * The use of 2000 expert annotations is slightly misleading as the dataset contains 100 total questions with 4 different responses and 5 annotators per response pair. This is never explicitly stated as this breakdown. * Phrasing in line 404-405 is slightly confusing? *There are inconsistent periods throughout the text. For example, line 182 doesn't have an ending period and the different paragraph headings sometimes have periods and sometimes don't (e.g., line 269, 313, etc.). 1. With only 100 questions (5 per topic) selected primarily by popularity (upvotes), does this benchmark captures the diversity and subtlety of mental health questions? 2. What makes the questions in the COUNSELBENCH-ADV adversarial in nature? Is it mostly in difficulty or along other dimensions? 3. The main results on COUNSELBENCH-EVAL is that LLM as judges might miss safety concerns yet the adversarial responses are then predominately judged using GPT-4.1, with only ~3% manually validated with a 72.9% agreement with human experts. This seems to be at odds with one another, and a bit hard to determine what might be the core takeaway. 4. Adversarial questions are designed from 3 specific models, but the results does not suggest that these are universal vulnerabilities. Should we expect a question designed to trigger specific vulnerabilities to also trigger issues across model families? Fully human-written
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Evaluating LLMs on authentic, open-ended mental-health questions is urgent, since even small errors can cause immediate harm, but current resources lean on MCQ proxies, small expert panels, or LLM-as-judge protocols that often miss clinically salient failures, motivating a large-scale, clinician-grounded benchmark with targeted stress tests. The paper introduces CounselBench, built around a clinically grounded six-dimension rubric spanning quality and safety (overall quality, empathy, specificity, factual consistency, medical advice, toxicity). It releases CounselBench-eval (2,000 evaluations by 100 licensed professionals on real patient queries, with span-level annotations and rationales) and CounselBench-adv (120 clinician-authored adversarial questions), shifting evaluation beyond factual correctness toward clinically meaningful behavior in mental-health QA. 1. Clear, practice-driven motivation with a clinician-grounded benchmark and stress tests. The paper targets authentic, open-ended mental-health QA, where small errors can harm users, and answers with a two-part benchmark co-designed with clinicians, including an adversarial set authored by 10 licensed professionals from failure modes observed in practice. 2. Comprehensive evaluation and careful curation. It defines a six-dimension rubric (overall quality, empathy, specificity, factual consistency, medical advice, toxicity) and scales to 2,000 expert evaluations by 100 licensed professionals with span-level annotations and written rationales, across 100 real patient questions spanning 20 topics, yielding both breadth and clinically meaningful granularity. 3. Thorough experimental coverage and transparent methodology. The work contrasts human and LLM judges under the same rubric, probes safety with a dedicated adversarial set, and documents the pipeline with an accompanying repo, enabling reproducibility. 1. The evaluation is static and single-turn: models answer a one-shot patient question with no opportunity to ask clarifying questions, adapt to feedback, or repair errors across turns, capabilities that are central to safe, supportive counseling dialogues. 2. The adversarial component is also single-round and non-iterative: prompts are issued once (with three stochastic samples) and failure triggers are labeled post hoc (largely by GPT-4 with limited human validation), rather than via an attacker–defender loop that escalates difficulty across rounds and model updates. This design surfaces important vulnerabilities but underestimates the challenges of sustained red-teaming and recovery dynamics. See weakness Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next