ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 4119
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 4.00 3239
Fully human-written 2 (50%) 5.00 4.50 1840
Total 4 (100%) 4.50 4.25 2760
Title Ratings Review Text EditLens Prediction
Dolphin: A multimodal large language model for Ultrasound Understanding Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes DOLPHIN, a multimodal large model for ultrasound understanding. The authors construct a large-scale, multi-source training corpus and define the Dolphin Ultrasound Data Protocol (DUDP) to unify data formats across various ultrasound tasks. The model is trained in three stages: post-training, instruction tuning, and reinforcement learning via UARPO. The paper reports a SOTA U2-score of 0.5835 on the U2-Bench, and observes that incorporating cross-domain deep reasoning data improves ultrasound reasoning capabilities. A Bayesian formulation with latent variable \(c\) is proposed to explain how deep reasoning prompts act as a prior guiding transfer. Additionally, the reasoning mode significantly outperforms the standard mode in diagnosis, detection, and measurement tasks. 1. DUDP provides a unified structured JSON format for multi-task/multimodal ultrasound data, facilitating training, evaluation, and reproducibility. 2. Cross-domain reasoning transfer is explored via two system prompts (vanilla vs. deep reasoning), revealing a U-shaped transition curve and backed by a Bayesian framework. 3. Comprehensive evaluation on 8 U2-Bench tasks and general/medical VQA benchmarks. Reasoning mode shows gains in diagnosis, detection, and measurement. 4. Deployment on real ultrasound devices is claimed, suggesting promising application potential. 1. Data size and source inconsistency: Abstract claims ">2,000,000" instruction-response pairs; the body and figures refer to both 2B (billion) and 2M, which conflicts. Clear explanation of counting method and usable data ratio is needed. 2. Key data curation details are missing: What are the exact sources of teaching materials? Are licenses obtained? What are the annotators' qualifications? Was annotation agreement assessed? Source ratio, deduplication strategy, and bias in synthetic/distilled data are not discussed. 3. Lack of statistical robustness: No significance testing, no variance across multiple seeds, and unclear if reported gains are beyond randomness. 4. Limited comparability and reproducibility: Comparisons with closed-source models (GPT-4/5, Gemini) lack disclosure of settings (prompt format, context length, decoding parameters, image resolution, CoT/multi-turn settings). This affects the fairness of evaluation. 5. Too few comparisons with other medical multimodal models: Only two models are compared. Classical medical VLMs such as LLaVA-Med, Qilin-Med-VL, LLaVA-Ultra, and EchoCLIP should be included. 6. Ethics and compliance concerns are insufficiently addressed: Textbook licensing, public data usage rights, and the legal status of synthetic/distilled content are not clarified. Real device deployment lacks information about clinical trials, safety measures, and human-in-the-loop design. 1. Please clarify the number of effective samples used in each stage (post-training, IFT, RLHF-like), and the ratio of public vs. private vs. synthetic/distilled data. How were duplicates handled? 2. Did you consider continuous rewards (e.g., IoU for detection, L2 for keypoints, MAE for measurements)? Does binary reward cause policy collapse or overfitting to format? 3. Please report all main results with mean ± std over 3–5 random seeds. Are U2-Bench improvements statistically significant? 4. Were decoding parameters (temperature, top-p, max tokens), image preprocessing (resolution, cropping), system prompts, CoT/multi-turn settings consistent across models? If not, please provide detailed comparisons. 5. Are textbooks and guidelines licensed for use? How were clinical/public datasets de-identified? 6. Please provide qualitative failure cases (image/video + reasoning steps) and analyze failure causes (non-standard views, artifacts, occlusion, parameter mismatch). 7. What are the latency, memory, and token usage differences between reasoning and standard modes? Are they acceptable for deployment? 8. Section 2 Data Curation: Which versions of GPT and DeepSeek were used for distillation? How were "overly short responses" or "excessive images" defined? What is the data volume before and after filtering? Fully AI-generated
Dolphin: A multimodal large language model for Ultrasound Understanding Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents Dolphin, an MLLM for ultrasound understanding with chat and reasoning modes. The model uses a three-stage pipeline that includes post-training on ultrasound and general data, instruction tuning for dialogue and step-by-step analysis, and reinforcement learning with Ultrasound Answer Reward Preference Optimization, named UARPO, to shape reasoning. The training corpus is built from textbooks, public ultrasound sets, distilled knowledge, and general sources, and is standardized with the Dolphin Ultrasound Data Protocol, or DUDP. The model is evaluated on the U2-Bench and various general and medical imaging tasks, outperforming existing models, as shown by state-of-the-art U2-Bench results. The study also analyzes the impact of mixed-domain reasoning data and the theoretical grounding of cross-domain reasoning transfer, highlighting the emergence of transferable reasoning capabilities in ultrasound tasks. 1. The analysis of cross-domain reasoning transfer is clear and thought-through. The paper gives a Bayesian account of how a prompt that asks for deep reasoning shifts the prior over chains of thought and can help even without in-domain reasoning data. The text explains the interference-to-transfer transition and ties it to a sufficient share of reasoning data, which matches the reported trend. 2. The paper reports top U2-Bench performance with a U2-score of 0.5835 and shows consistent benefits from the deep reasoning mode on diagnosis accuracy, detection accuracy, and measurement error. It also states that the system has been integrated on real ultrasound devices, which speaks to practical relevance. 1. The pipeline uses GPT-4 semantic screening, another LLM for response checks, and a brief mention of medical expert validation, yet the paper does not report who the experts are, how many experts participated, what specialties they represent, what rubric or thresholds were used, or any inter-rater agreement. This makes it hard for peers to reproduce, audit bias, or estimate error rates. Given ultrasound’s dependence on subtle imaging cues like artifacts and view selection, reliance on general LLM judges without detailed safeguards raises the risk that flawed pairs enter the data at scale. 2. The set of medical baselines in Table 1 is narrow. The medical-specific block lists MiniGPT-Med and MedDr, while several commonly cited medical models are absent. As a result, the state-of-the-art claim on U2-Bench rests on a comparison that omits strong medical references, which weakens the strength of the claim for the medical VLM landscape. 3. Table 1 lists only MiniGPT-Med and MedDr in the Medical-Specific Open-Source Models block, while widely used medical baselines such as LLaVA-Med and Huatuo-O1 are missing (e.g., I think LLaVA-Med can achieve more than 60% acc on VQA-Rad). Instead, it places LLaVA-1.5-13B under general open-source models, not the medical variant. As a result, the comparison set is narrow and likely weaker, which makes it difficult to judge whether the reported gains truly represent state-of-the-art performance in the medical VLM landscape. 4. Some of the images in Figure 4 have been compressed, affecting readability. Please see the weaknesses above. Lightly AI-edited
Dolphin: A multimodal large language model for Ultrasound Understanding Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents Dolphin, a multimodal large language model (MLLM) for ultrasound understanding with cross-domain reasoning transfer. It introduces the DUDP protocol to standardize multimodal ultrasound data and adopts a three-stage training process (post-training, instruction tuning, reinforcement learning). Experiments show that Dolphin significantly outperforms existing state-of-the-art models in both the specialized ultrasound domain tasks. - The paper is well-motivated and explores an important clinical application area, with a clear and well-structured presentation. - It proposes a practical and comprehensive domain adaptation framework that can be readily applied to heterogeneous ultrasound data in real-world settings. - The experimental evaluation is comprehensive, covering multiple datasets and metrics, and including medical expert validation to verifiy its reliability - The reported 0.8:0.2 ratio in mixed reasoning is empirical and lacks systematic ablation. - Despite achieving strong results on accuracy-driven, annotated benchmarks, the paper lacks a systematic evaluation of Dolphin's LLM text generation quality, specifically regarding its clinical relevance and expert human assessment for report and chat outputs. - While the paper states Dolphin is capable of understanding videos. The paper lacks a detailed evaluation of Dolphin's performance on on time-series ultrasound video data. - Is general reasoning data mixed during post-training or added as a separate fine-tuning phase? - Does the 0.8:0.2 ratio remain optimal across different models or modalities such as CT or MRI? - How does Dolphin perform on some other tasks on general biomedical domain? Such as text-only benchmarks like MedQA. Will there be a degradtion of improvement in performance compared to the base model? Fully human-written
Dolphin: A multimodal large language model for Ultrasound Understanding Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents Dolphin, a VLM for ultrasound images understanding. The authors compile and standardize a vast ultrasound dataset for reinforcement learning purposes. The models, Dolphin-7B and Dolphin-72B, achieve state-of-the-art performance on the ultrasound benchmark, U2-Bench. - I recognize the engineering effort involved in compiling and processing vast ultrasound datasets. It would be very helpful for the community if the authors could release standardized data publicly. - The Dolphin achieves very promising performance on the ultrasound benchmark, while preserving good performance on general medical benchmarks. - One of the main findings in the paper: "cross domain reasoning universality" has been already systematically explored by previous work [1]. - Lacking methodology novelty: the training paradigm of SFT with distilled Chain-of-Thought data, followed by GRPO-based RL, is widely used by a lot of works. The proposed UARPO makes no difference to me from general GRPO. - The core contribution of this paper would be the processing and standardizing of vast public ultrasound data. But the key data processing details, such as source data and task distribution, are missed in the current paper. [1] Liu, Qianchu, et al. "X-reasoner: Towards generalizable reasoning across modalities and domains." arXiv preprint arXiv:2505.03981 (2025). - Can you provide more data processing details? Where does your source data come from, and what is the distribution across each ultrasound task? Have you checked the quality of this public data? Will this processed data be made publicly accessible? - As you mentioned, "Dolphin has been successfully deployed on real-world ultrasound devices," can you provide more details about how the Dolphin is involved in the clinical process and the feedback from ultrasound experts when using the Dolphin? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next