ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 3 (75%) 4.67 3.67 4761
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 2.00 5.00 2015
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 4.00 4.00 4074
Title Ratings Review Text EditLens Prediction
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces MedInsightBench, a new benchmark for evaluating medical analytics agents on the ability to discover multi-step, clinically meaningful insights from multi-modal medical data (particularly pathology images and reports). The benchmark contains 332 curated medical cases and 3,933 verified insights, each annotated with corresponding questions, evidence, and analysis goals. The authors further propose MedInsightAgent, a multi-agent framework comprising three components: Visual Root Finder (VRF) – identifies salient visual features and generates initial analytical questions. Analytical Insight Agent (AIA) – answers questions and synthesizes insights using a specialized pathology model (PathGen-LLaVA). Follow-up Question Composer (FQC) – iteratively generates deeper and complementary questions to extend reasoning. The paper introduces an automated evaluation protocol for insight discovery (Recall, Precision, F1, and Novelty) and provides comprehensive experiments comparing MedInsightAgent to state-of-the-art LMMs (GPT-4o, GPT-5, Qwen2.5-VL, Deepseek-VL2, InternVL3-38B) and ReAct-based frameworks. Results show that MedInsightAgent (Qwen2.5-VL backbone) achieves the best performance across all metrics, substantially improving insight novelty and interpretability. Novel task definition: Insight discovery in multimodal medical contexts is a new and important evaluation dimension. High-quality dataset: Expert-verified, diverse, and hierarchically structured (goals → questions → insights → evidence). Robust evaluation: Four complementary metrics (Recall, Precision, F1, Novelty) enable both quantitative and qualitative assessment. Strong baseline coverage: Includes several frontier LMMs and a ReAct comparison. Agent design clarity: The modular decomposition (VRF–AIA–FQC) provides transparency and potential for transfer to other domains. Comprehensive analysis: Includes ablation, human evaluation, redundancy statistics, and case studies illustrating improvement in reasoning depth. Domain limitation: All data originates from TCGA pathology, focusing heavily on cancer cases. The generalizability to other medical imaging types (radiology, histology, ophthalmology) remains untested. Limited human expert benchmarking: While correctness and rationality are validated, there is limited comparison of agent-generated insights to expert-generated analytical reports. Lack of longitudinal reasoning: The framework operates on single-image cases without temporal patient data, which would better test clinical reasoning. Scalability and cost: The multi-agent setup and web retrieval modules increase inference latency and resource cost, though this is not quantified. Novelty metric dependency: Novelty scoring relies on textual distance metrics (ROUGE/G-Eval) rather than domain novelty validation by clinicians. 1-How does MedInsightAgent handle contradictory evidence across multi-modal sources (e.g., text vs. image)? 2-Could the Follow-Up Question Composer benefit from reinforcement learning or self-critique loops to adaptively decide iteration depth instead of a fixed parameter p? 3-Have you evaluated zero-shot transfer of the MedInsightAgent to non-pathology modalities (e.g., radiology, dermatology)? 4-How consistent are the insight novelty metrics with human perception of novelty? Any inter-rater reliability studies? 5-Would you consider releasing a smaller public subset of MedInsightBench with de-identified samples for community benchmarking? Fully AI-generated
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents MedInsightBench, a benchmark framework for evaluating large multimodal models (LMMs) and agent-based systems in multi-step medical insight discovery. The benchmark includes 332 cancer pathology cases and 3,933 expert-validated insights, integrating medical images, reports, and analytical goals into a unified evaluation scheme. The authors further propose a three-stage multi-agent framework, MedInsightAgent, comprising a Visual Root Finder, an Analytical Insight Agent, and a Follow-up Question Composer. Through visual analysis, question generation, external knowledge retrieval, and multi-turn reasoning, the framework enhances interpretability and insight depth. Experiments demonstrate that MedInsightAgent outperforms mainstream LMMs (e.g., GPT-4o, GPT-5, Qwen2.5-VL) in both F1 and Novelty metrics, highlighting the limitations of current models and their potential for improvement in medical insight generation. 1. The dataset is well-designed, balancing image quality, analytical objectives, and question–insight pairing, which ensures strong systematicity and evaluation value. 2. MedInsightAgent adopts a multi-round chain structure (Root Question → Insight → Follow-up), effectively enhancing the depth and diversity of insights while improving interpretability. 3. The benchmark introduces four complementary metrics—Insight Recall, Precision, F1, and Novelty—offering a more rigorous and comprehensive evaluation than prior text-matching–based methods. 1. The insight generation and validation process depends heavily on manual proofreading, which may limit scalability, consistency, and efficiency when applied to larger or more diverse medical datasets. 2. Although the multi-agent framework (MedInsightAgent) is conceptually interesting, its algorithmic design remains largely engineering-driven, lacking explicit optimization objectives, convergence proofs, or theoretical analysis of complexity. 3. The mathematical formulations (Eq.1–3) mainly describe procedural steps rather than formal optimization goals, reflecting limited theoretical depth and rigor. 4. The definition of “insight” is semantically ambiguous; while the authors propose F1 and Novelty metrics, they do not clearly distinguish between linguistic novelty and genuine medical discovery, and some examples resemble report paraphrasing. 5. The experiments, though extensive, lack statistical significance testing, detailed error analysis, and cross-domain generalization evaluation, which weakens the reliability of the reported improvements. 6. Comparisons with other medical agent frameworks (e.g., MedAgentsBench, AgentClinic) are insufficient, and the integration potential with large medical foundation models such as Med-PaLM M is not explored. 7. The inter-agent communication mechanism, including message passing and reasoning order, is under-specified, making it difficult to reproduce or verify the cooperative reasoning logic. 8. The benchmark focuses primarily on cancer pathology and single-round insight generation, without testing transferability to other modalities (e.g., radiology or endoscopy) or multi-stage clinical reasoning, which limits generalization and real-world clinical relevance. 1. How do the authors ensure causal consistency across each reasoning step within the multi-agent framework? Are there potential issues of pseudo-insights or reasoning drift during multi-turn inference? 2. How is the baseline for measuring “Original vs. Innovation” defined? Could linguistic diversity be mistakenly identified as genuine insight innovation? 3. How is the correlation between images and reports quantitatively validated, and does error propagation across modalities affect the accuracy of the final insights? 4. Was any inter-rater agreement analysis conducted to assess annotation reliability? Could LLM-assisted insight generation introduce semantic noise or bias? 5. If MedInsightAgent were to be applied to other medical domains (e.g., radiology or multi-organ pathology), would the Root Question module need to be redesigned or re-trained for domain adaptation? Fully AI-generated
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces MedInsightBench, a new benchmark dataset for evaluating the ability of large multi-modal models to discover multi-step, deep insights from multi-modal pathology data. The authors also propose MedInsightAgent, a multi-agent framework designed to improve insight discovery, and show that it outperforms baseline LMMs on their new benchmark. The paper's primary strength is addressing an important gap in existing evaluations. The experimental comparisons are limited, the methodology for dataset creation and evaluation lacks transparency, and the true novelty of the agent framework's contribution is unclear: 1. The evaluation compares MedInsightAgent against LMMs-only and a single general-purpose agent framework ReAct, while the paper's own related works" section lists numerous domain-specific medical agent frameworks (e.g., MedAgentsBench, AgentClinic). Some of these works should be included as baselines. 1. The paper repeatedly states that human verification or human experts were used to curate the dataset. However, it provides no details on the verifiers' qualifications (e.g., were they board-certified pathologists?), the verification protocol, or the inter-annotator agreement. 2. The "Insight Novelty" metric, a key part of the proposed evaluation framework, is poorly explained and methodologically questionable. Appendix states that correct insights are those with a G-Eval score > 5, an arbitrary threshold that seems inconsistent with all reported G-Eval scores in Table 3. It then describes a process where incorrect insights are re-evaluated for novelty. This process is opaque and makes it difficult to trust the reported novelty scores. 3. The ablation study shows the IAT has the greatest impact on performance. It is therefore unclear how much of MedInsightAgent's performance gain comes from its novel agentic orchestration versus simply using a powerful, specialized tool that other baselines (like ReAct) were not given access to. Please see the weaknesses above. Lightly AI-edited
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces **MedInsightBench**, a new multimodal benchmark consisting of 332 cancer pathology cases (≈3.9 k annotated insights) that pairs whole‑slide images with structured analytical goals, question‑insight pairs, and difficulty levels. It also proposes **MedInsightAgent**, a three‑module multi‑agent pipeline (Visual Root Finder, Analytical Insight Agent, Follow‑up Question Composer) that leverages image summarisation, web retrieval, and a pathology‑fine‑tuned LMM (PathGen‑LLaVA) to generate multi‑step medical insights. Experiments compare several state‑of‑the‑art large multimodal models (GPT‑4o, GPT‑5, DeepSeek‑VL2, Qwen2.5‑VL‑32B‑Instruct, InternVL3‑38B) and two agent baselines (ReAct, MedInsightAgent) on the benchmark using four automatically computed metrics: Insight Recall, Insight Precision, Insight F1, and Insight Novelty (original vs. innovation scores). Ablation studies remove individual MedInsightAgent modules to assess their impact. 1. **Novel Benchmark Idea** – The focus on *multi‑step insight discovery* rather than single‑turn QA is a worthwhile gap in current multimodal evaluation. 2. **Dataset Construction Pipeline** – The authors describe a fairly detailed pipeline (WSI down‑sampling, OCR‑based report extraction, LLM‑assisted insight generation, human verification) and provide some quality analyses (correctness, rationality, coherence). 3. **Agent Architecture** – The three‑module design is clearly motivated and the paper includes a full algorithmic description, making the system reproducible in principle. 4. **Comprehensive Baselines and Ablations** – A wide range of recent LMMs are evaluated, and a ReAct‑style agent baseline is included for comparison. The paper also quantifies the contribution of each MedInsightAgent component, showing measurable performance drops when modules are removed. - There seems to be a mismatch between ground truth (largely extracted from pathology reports) and model inputs during evaluation. Many “ground-truth insights” (e.g., HPV/p16 status, node counts, margins, R-status, IHC panels) cannot be inferred from an H&E image alone, especially after whole-slide downsampling to PNG. In Table 7 and case studies, several insights are report-only facts. If the benchmark input at test time is Goal + Image (as Table 2 indicates), a substantial subset of ground-truth is fundamentally unanswerable from the provided modality, confounding recall/precision/F1 and making negative findings uninterpretable. Unless I missed something, the benchmark input appears to exclude the report text at evaluation time; this undermines the “multi-modal” positioning and obscures what is measurable. - Baseline parity is not ensured. MedInsightAgent uses an additional domain-specific image-analysis tool (PathGen-LLaVA) and web retrieval; ReAct is restricted to a computation module and web search. If multi-tool access improves performance, ReAct should be given equivalent tools to isolate the effect of the orchestration strategy rather than tool availability. - Limited human evaluation and unclear rigor: The paper mentions 10 human experts for 100 data points and a separate 100-sample data quality audit, but provides no inter-annotator agreement, precise scoring rubric, or confidence intervals. Claims like “strong concordance with human judgments” lack quantified evidence (e.g., Pearson/Spearman/Kendall correlations, bootstrap CIs). There is also no statistical significance testing, confidence intervals, or variance reporting across methods. Reported gains (e.g., G-Eval F1 improvements of ~0.05–0.06) may be within evaluator noise. - I also couldn't figure out if PathGen-LLaVA was exposed to similar TCGA distributions (potential leakage) and how cases were partitioned. There is also no compute/latency/cost reporting for the agent loops. - Case studies show MedInsightAgent often produces general, textbook-like statements (e.g., “perineural invasion suggests aggressive tumor”), which can inflate “novelty” under the current metric but do not demonstrate image-grounded discovery. Without human verification that each output is supported by the image at the provided resolution, improvements may reflect better phrasing rather than better clinical insight. Consequently, the conclusion that “higher F1 corresponds to greater novelty” may be an artifact of the novelty scoring pipeline rather than a real causal relation. - The claims “first comprehensive benchmark for medical insight discovery” and “strong concordance with human judgments” are not sufficiently supported. Prior work targets medical agents and multimodal evaluation; the novelty claim too should be carefully scoped to “pathology image insight discovery with goal/question/insight structure.” 1. Inputs: At evaluation time, do models see only Goal + Image, or also the report text? If only Goal + Image, how do you justify including report-only insights (e.g., HPV status) in ground truth and metrics? 2. Image fidelity: What downsampling ratios, target resolutions, and magnifications are used? Are multi-scale tiles or high-power patches provided? How do you ensure image sufficiency for cellular-scale findings? 3. Ground-truth filtering: Did you filter insights to those image-inferable from H&E at the provided resolution? If not, what fraction of insights are inherently non-inferable from the image alone? 4. Human validation: Who were the human experts (qualification, number per sample)? Please report inter-annotator agreement (e.g., Cohen’s/Fleiss’ kappa) and confidence intervals for human scores. 5. Novelty metric: How many “novel” insights were spot-checked by pathologists? What fraction were truly image-grounded? Please report a blinded human audit on a random subset. 6. Baseline parity: Why not equip ReAct with PathGen-LLaVA and identical tools to isolate orchestration benefits? Conversely, evaluate MedInsightAgent without PathGen-LLaVA to separate tool vs. agent effects. 7. Statistical rigor: Please report variance, CIs, and tests of significance (e.g., bootstrap) for Table 3 and Table 4. Also provide human-vs-automatic evaluator correlations with CIs. 8. Data release and splits: Will the dataset, code, and prompts be released, including a clear train/val/test split and case IDs? Any overlap between PathGen-LLaVA training data and TCGA cases used here? 9. Safety: Will you add explicit guidance that outputs are research-only and not for clinical decision making? 10. Claims: Please precisely scope the “first benchmark” claim and provide a systematic comparison distinguishing MedInsightBench from MedRepBench, 3MDBench, and other agent-based medical datasets. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next