ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	4.00	3.50	2948
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	2 (50%)	4.00	3.50	3252
Total	4 (100%)	4.00	3.50	3100

Title	Ratings	Review Text	EditLens Prediction
Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduced Fleming-R1, a model post-trained for medical reasoning. The authors proposed three strategies to boost the performance: (1) a reasoning-oriented data strategy to synthesize medical training data based on knowledge graphs mined from Wikipedia and mix the synthetic data with quality medical QA data; (2) CoT cold start to distill and refine CoTs from powerful proprietary models for the synthetic data and apply SFT to Fleming-R1; (3) a two-stage RLVR + GRPO framework to train Fleming-R1 in an iterative curriculum learning setting. The evaluation results show that Fleming-R1-7B outperforms much larger baselines, and Fleming-R1-32B achieves comparable performance as GPT-4o. S1: The evaluation results seem strong, and Fleming-R1 could be a valuable contribution to the open-source community. S2: The motivation that medical reasoning needs not only correct results but also faithful reasoning processes is strong. S3: The training recipe of Fleming-R1 is interesting and comprehensive. W1: How did you ensure the quality of the medical knowledge graph discovered by the autonomous agent in Section 3.1? Any manual verification by medical experts/practitioners? W2: Can you give a real example in your synthesized questions to show that it actually asks for a diagnosis given symptoms and lab results (the example in Lines 223 to 226)? W3: Generating medical QA based on masked subgraphs is more like an enhanced way to memorize medical knowledge, instead of boosting the reasoning capability as described in Line 222 [see results in Physics of Language Models: Part 3.1, Knowledge Storage and Extraction]. W4: In Line 233, the filtering is based on whether GPT-4 can answer the question across five trials. However, even if an SOTA model answers the question correctly, it does not necessarily mean that this label is correct. It could be that the SOTA model is wrong. W5: The first half of Section 3.3 is merely an application of the existing GRPO algorithm. It is more precise to frame it as a preliminary. W6: In the second phase of the RL curriculum learning in Section 3.3, if the model repeatedly fails on some questions during RL training, it usually means that the model lacks the relevant knowledge during its pretraining. I doubt if repeated RL training on these failed questions, even with more rollouts, can actually help. Can you provide some examples to show that this setup actually works? W7: There are not enough details on the training setup, e.g., learning rate, optimizer, size of training data at each stage, compute budget, etc. W8: Since one of the key motivations described in the introduction is that we need not only a correct answer but also correct reasoning paths for medical tasks, can you provide any evaluation (including manual verification by medical practitioners/experts) on the correctness of the reasoning paths generated by Fleming-R1? It seems that most of the results reported in the experiment section are still based on the accuracy of the final result. Q1: Can you provide more details on the de-identification process mentioned in Line 236? Q2: Is there any data overlap in the training and testing phases of Fleming-R1 for MedQA, MedMCQA, and PubMedQA? Please specify that in the paper. Q3: Can you show the scaling effect of your training process? For example, including more of the synthetic data during SFT or doing more iterations of RL training helps with performance improvements? Q4: Can you show other ablation results: for example, apply RL stage 1 to 7B without COT Cold Start, and apply RL stage 2 to 32B without stage 1, to demonstrate if this ordering of steps is truly necessary?	Fully human-written
Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Fleming-R1 combines a reasoning-oriented data strategy (curated QA + Wikipedia graph–guided synthetic multi-hop items), a CoT cold start distilled from teacher trajectories, and two-stage GRPO-based RL from verifiable rewards (answer + format) with adaptive hard-sample mining. The 7B model outperforms larger open baselines; the 32B model approaches GPT-4o on MedXpertQA and leads most open-source peers across nine benchmarks, indicating parameter-efficient gains. - Cohesive training pipeline: data design + CoT initialization + GRPO RL. - Strong parameter efficiency (7B beats larger baselines; 32B near GPT-4o on MedXpertQA). - Synthetic graph sampling increases long-tail and compositional coverage. - Clear ablations showing stage-wise improvements. - “Verifiable rewards” focus on final answer/format; no process-level reasoning rewards. - Limited transparency on teacher CoT quality and synthetic graph correctness; potential bias/noise. - No contamination/overlap audit across public benchmarks. - Sparse sensitivity analyses (GRPO settings, curriculum, seeds) and process-level evaluation. - Can you add process-level rewards (step consistency, guideline alignment) and report their impact? - How are synthetic QA and graph relations validated; any clinician audit? Effects of reducing synthetic data? - What deduplication/time-slicing controls prevent train–test leakage? - Which teacher(s) and filtering criteria for CoT; ablations with/without iterative refinement? - Sensitivity to GRPO group size/rollouts/curriculum - Process-level reasoning assessment (LLM/human) on MedXpertQA and error taxonomy.	Fully AI-generated
Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This manuscript presents Fleming-R1, a large language model (LLM) designed to achieve expert-level medical reasoning through reinforcement learning (RL), with a strong emphasis on verifiable and transparent inference chains rather than accuracy alone. The key innovations include: (1) Reasoning-Oriented Data Strategy (RODS), which integrates curated question-answering datasets with knowledge-graph-guided synthesis from a Wikipedia-derived graph to enhance coverage of multi-hop reasoning and long-tail entities (e.g., rare diseases); (2) Chain-of-Thought (CoT) Cold Start, distilling high-quality reasoning trajectories from teacher models (e.g., GPT-OSS-120B) with iterative refinement; and (3) Two-Stage Reinforcement Learning from Verifiable Rewards (RLVR) using Group Relative Policy Optimization (GRPO) for skill consolidation and adaptive hard-sample mining. The model is trained on Qwen2.5-7B and Qwen3-32B bases, with the 7B variant outperforming larger baselines on benchmarks such as MedXpertQA, and the 32B variant approaching GPT-4o performance. The authors release the model openly to advance auditable medical AI. - Parameter Efficiency and Scalability: Impressive results at small scales—e.g., 7B model beating 72B baselines on multiple tasks—demonstrate efficiency. - Strong Empirical Evaluation: Comprehensive benchmarks (e.g., MedXpertQA, MedQA, PubMedQA) with ablation studies clearly show gains from each component. The focus on expert-level reasoning (e.g., multi-hop inference on rare cases) and open release promotes reproducibility and community progress. - Targeted Data Strategy: RODS effectively addresses gaps in multi-hop and rare-case coverage through knowledge-graph synthesis, a nuance less emphasized in comparable works . - Comprehensive Evaluation: Ablation studies and comparisons across nine benchmarks robustly demonstrate component contributions, despite overlaps with prior work. - Limited Novelty: The technical approach is highly similar to recent papers, such as Med-U1 (multi-objective RL for verifiable CoT) and AlphaMed (rule-based rewards for medical reasoning enhancement), as well as EHRMIND and related works . These have already established RL's effectiveness in clinical tasks (huatuogpt-o1). The contributions here seem confined to minor refinements, such as long-tail data augmentation via Wikipedia graphs, lacking transformative innovation. - Reward Design Risks: While GRPO is effective, the rewards focus solely on correctness and format, potentially leading to superficial reasoning or reward hacking, which is underexplored compared to multi-objective approaches in similar papers. Considering that the medical domain has a much lower tolerance for errors in chain-of-thought reasoning, this limitation becomes even more critical. - Lack of Fine-Grained Ablations for Individual Components: Although the paper highlights the importance of long-tail knowledge in the medical domain and introduces a knowledge-graph-based approach (RODS) for synthesizing such data, it lacks detailed ablation studies to demonstrate the specific contribution of this component. For instance, experiments on benchmarks that emphasize long-tail characteristics, accompanied by in-depth data analysis, or well-designed ablations isolating RODS’s synthesis strategies, would more convincingly validate its effectiveness. The absence of such targeted evaluations makes it difficult to quantify the true value of RODS and weakens the overall evidence for its modular contribution. - Given the similarities to works like Med-U1 and AlphaMed , how does the core innovation (e.g., RODS for long-tail synthesis) differentiate this paper? Have direct comparative experiments been conducted? - Is the two-stage RLVR merely a fine-tuning of existing GRPO methods? Please provide additional evidence of its unique advantages in long-tail scenarios beyond general enhancements. - How was the accuracy of knowledge-graph synthetic data validated? What error rates were observed, particularly for rare disease entities, in light of Wikipedia's potential inaccuracies ? - If the technical pipeline is largely identical to prior art, how do the authors plan to strengthen novelty in revisions, such as through new benchmarks or clinician-involved studies?	Fully AI-generated
Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents Fleming-R1, a large language model (LLM) fine-tuned for medical use. The authors curated a dataset using reasoning-oriented data strategy (RODS), which includes both public medical QA dataset and synthetic data generated from Wikipedia knowledge graph. This dataset is supposed to be reasoning intensive. They then use another larger model to generate some reasoning trajectories and filtered them. These trajectories are distilled into their own model with supervised fine-tuning. Together with a two-stage reinforcement learning with verifiable rewards (RLVR), the Fleming-R1 achieves state-of-the-art results on many benchmarks, often surpassing larger-scale models. 1. A curated dataset with synthetic data is generated and released, which is shown to be very effective on evaluation benchmarks. 2. Fleming-R1 is trained with both cold-start and 2-stage RLVR. The design of their recipe is not identical to a standard pipeline: E.g. it filters out low-quality reasoning trajectories during the cold-start stage. These designs and findings are valuable and can be potentially expanded to non-medical domains. 3. The performance of Fleming-R1 is strong: It can match proprietary models such as GPT-4o and surpass larger-scale models. 4. The analysis are comprehensive, showing the effectiveness of all the components of the model. 1. Limitation on Modality: Fleming-R1 is a text-only model, which makes it unsuitable for many medical tasks. It might not be very difficult to migrate the majority of this pipeline to vision language models such as Qwen-2.5/3-VL series, as the tools developed by ML community, Verl/EasyR1 for instance, have been mature on this end. 2. Missing details: The paper misses some key details on the data construction pipeline and methodology. For example, how are the graph masked for the synthetic data generation? Is there an algorithm to decide what to mask and is there a constant masking ratio? How do you use LLM to classify the questions into Easy, Moderate, and Difficult? For the training, how did you refine the training distribution iteratively? 3. Limitation on problem setup: The model was trained intensively on question answering problem, which may make the model specialized on QA but lose general capabilities. It would be interesting if the authors can demonstrate that the model can still be used as general-purposes model that can be used for non-QA medical problems. As a reference, MedGemma mixed a lot of non-medical data and diverse medical datasets during training, thus it can still be used as a chatbot. Given that many spaces can still be squeezed out from the paper, can you write more details on the pipeline as I raised on the second point of the weakness section? It would be helpful for the reproducibility of this paper.	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next