ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 5.00 4735
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 6.00 3.00 1793
Fully human-written 2 (50%) 3.00 3.50 3844
Total 4 (100%) 4.50 3.75 3554
Title Ratings Review Text EditLens Prediction
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes using a multi-modal large language model as a backbone for training a joint image-text discrete diffusion model in the medical domain. The authors train their model on an X-ray and a pathology dataset and perform image, text, and joint image-text generation experiments to validate the performance of the trained model. - Using a multi-modal LLM (MLLM) as a backbone for training a joint image-text discrete diffusion model for the medical domain is a novel idea and could have interesting implications for future development of foundation models in the medical domain. - The text generation results show significant improvements over the MMLM baseline. - The joint generation results indicate that the model can serve as a strong synthetic data generator for image and text in the medical domain. - The idea of training a single discrete diffusion model for text and image generation is not new, as it has already been discussed in [1] and [2]. The authors' two contributions to make the model work with an MLLM backbone are (1) causal attention removal and (2) injecting continuous timestep embeddings, both straightforward modifications of the transformer network. Therefore, the overall contribution of the paper is limited to showing that a pre-trained MLLM can serve as the backbone for the generative model. - In line 124, the authors say that "*Our experiments demonstrate that MeDiM can function as a versatile foundation model*". However, the model is trained on ~1M samples from just two sources, which does not support the claim of having trained a foundation model. There are unanswered questions regarding the scaling ability of the proposed model and how well it could cover many different cancer/organ types. I would suggest scaling back the foundation model claim. - The models used for measuring image quality on the pathology image generation task are significantly worse than the state-of-the-art. All baselines reported in the paper have FID scores >50, with the proposed model achieving a score of 24. However, looking at a recent generative model trained exclusively on pathology images [3], the reported FID on similar datasets is <10. This raises the question of whether the baselines used for image generation are valid for comparing the proposed model to the state-of-the-art. - The paper would greatly benefit if there were comparisons on out-of-distribution datasets, at least for some of the tasks. It is unclear whether the improvements shown stem from fine-tuning the model on the two specific datasets on which the model is also tested, or from learning to jointly generate image and text reports. [1] Hu, Minghui, et al. "Unified Discrete Diffusion for Simultaneous Vision-Language Generation." ICLR 2023 [2] Swerdlow, Alexander, et al. "Unified multimodal discrete diffusion." arXiv preprint 2025 [3] Yellapragada, Srikar, et al. "PixCell: A generative foundation model for digital histopathology images." arXiv 2025 - Would a pathology-specific generative image model achieve a lower FID than the proposed model? The paper requires a strong baseline to show how the unified training improves or hurts the performance of the model on image generation. - Are the improvements in text generation over MED baselines because of the joint image-text generation training or because of fine-tuning the proposed model on the MIMIC-CXR and PathGen datasets? What do the results look like if you fine-tune the baselines on these datasets for a similar number of iterations? Fully human-written
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes MeDiM, the first medical discrete diffusion model that integrates multimodal large language models (MLLMs) for unified medical generation. The framework supports three distinct tasks in a single model. To adapt MLLMs, which are trained with causal masking to the bidirectional nature of diffusion, the authors introduce two key modifications: (a) Removing causal attention masks for cross-modal alignment; (b) Injecting continuous timestep embeddings for diffusion awareness. Empirical results on MIMIC-CXR and PathGen demonstrate SOTA performance. (1) First demonstration of integrating an MLLM within discrete diffusion for multimodal medical tasks. (2) Promising results suggesting potential as a foundation framework for medical multimodal learning. (1) Lack of training efficiency and scalability analysis. (2) Evaluation limited to two datasets; generalization to other medical modalities is untested. (3) No uncertainty or robustness analysis (e.g., multiple seeds, out-of-domain data). (4) It would be better to consider the relationship to relationship to prior unified multimodal models (e.g., MMaDA, UniDisc), which could be analyzed more deeply on a conceptual level. (5)The work feels slightly incremental relative to recent unified multimodal diffusion efforts outside the medical domain. It would be better to show some quantitative evidence for consistent outperforming the non-diffusion multimodal models. (1) What is the computational cost and model scale (parameters, GPU hours) required to train MeDiM? (2) How does MeDiM perform on out-of-domain data (e.g., unseen imaging modalities)? (3) Are there any failure modes observed in paired generation (semantic drift, hallucination, etc.)? How could MeDiM alleviate it compared to previous SOTA baselines? Lightly AI-edited
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes MeDiM, a discrete diffusion framework that uses an MLLM backbone to jointly model medical images and clinical reports, supporting (i) report-to-image, (ii) image-to-report, and (iii) paired image-report generation. Key adaptations include removing the causal mask, injecting timestep embeddings, and AdaLN modulation to make an autoregressive MLLM usable as a bidirectional denoiser. On MIMIC-CXR and PathGen, the paper reports strong FID/METEOR and claims downstream gains when augmenting data with generated pairs. 1. **Unified formulation** for three medical generation tasks in one model. paired generation is compelling for clinical AI agents. 2. **Clear architectural adaptations** (mask removal + time embeddings + AdaLN) that make practical sense for discrete diffusion with token sequences. 3. **Ablations** evaluated that MLLM backbones are advantageous in this setting. 4. **Downstream evaluation** attempts (training R2Gen on real+synthetic) are a good step toward utility, not just perceptual scores. 1.Evaluation protocol is not aligned with medical best practices. i) FID/IS with natural Inception are weak fidelity surrogates for chest X-rays and histopathology; domain encoders (e.g., pathology CLIP-like encoders[4]) or clinical labelers (CheXbert[1], RadGraph[2], GREEN Scores[3]) are standard. The paper relies on FID/IS and generic NLG metrics (BLEU/METEOR/ROUGE), which miss clinical correctness. ii) LLM-as-judge (Qwen2-VL) for alignment is risky in clinical domains. The setup, prompts, and scales aren’t specified enough to assess validity, and Qwen-2VL is not clinically validated. iii) Human study is small (n=100 pairs) and lacks reporting on annotator expertise (domain experts vs general), disagreement handling, and significance tests. 2. Concerns with competing methods: Radiology report generation should be compared against recent MLLM radiology methods (e.g LLavaRAD[5]) beyond classic R2Gen variants. Similarly, there should be some comparison with T2I baselines for Histopathology image synthesis(not unified) (e.g. [6][7][8]) warrant inclusion or discussion. 3. Some of the models are incremental and anticipated when porting diffusion to AR MLLMs. The novelty is moderate and hinges on the medical instantiation and paired synthesis rather than fundamentally new learning objectives or theory. 4. This work highlights SoTA in places, yet Pathology report generation is not SoTA on most metrics (only 1/5 wins), and the discrepancy isn’t analyzed - this weakens the “unified foundation” claim. 5. Minor issues: Fig 4A last column Med-Art for radiology is incorrect. Fig 4c, first row does not include colors in the prompt. [1] Smit et al., “CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling,” Findings of EMNLP 2020. [2] Jain et al., “RadGraph: Extracting Clinical Entities and Relations from Radiology Reports,” arXiv 2021. [3] Ostmeier et al., “GREEN: Generative Radiology Report Evaluation and Error Notation,” Findings of EMNLP 2024. [4] Huang et al., “PLIP: A Visual-Language Foundation Model for Pathology Image Analysis,” Nature Medicine 2023. [5] Zambrano Chaves et al., “LLaVA-Rad: A Clinically Accessible Small Multimodal Radiology Model,” Nature Communications 2025. [6] Yellapragada et al., “PathLDM: Text-Conditioned Latent Diffusion Model for Histopathology,” WACV 2024. [7] Aversa et al., “DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch Diffusion in Histopathology,” NeurIPS 2023 D&B. [8] Graikos et al., “Learned Representation-Guided Diffusion Models for Large-Image Generation,” CVPR 2024 Q.1. Can the authors include in justifiable in-domain evaluation protocol to better understand the reliability of the results presented if in domain backbones are not used? Please check Weakness 1 for more details to address the concerns. Q.2. Can non-unified report and image generation methods be included in the comparison? This will help us distinguish real gains of the proposed unified framework. Q.3. Can you clarify the failed cases in downstream tasks and pathology report generation? Fully human-written
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces MeDiM, the first unified medical discrete diffusion model designed to overcome the fragmentation of current medical AI by jointly modeling images and text. The core innovation is using a pre-trained Multimodal Large Language Model (MLLM) as the diffusion backbone. To adapt the MLLM from its autoregressive (causal) nature to the bidirectional needs of diffusion, the authors remove the causal attention mask and inject timestep embeddings. This single framework can perform text-to-image generation, image-to-report generation, and joint image-report pair generation, achieving state-of-the-art results by a wide margin, such as a 16.60 FID on MIMIC-CXR for image generation. 1. The model achieves SOTA results by a very large margin, not an incremental one. In Table 1, MeDiM's 16.60 FID on MIMIC-CXR is dramatically better than the next-best baseline (Med-Art's 78.97). This represents a step-change in quality for this task. 2. The central idea of adapting an off-the-shelf autoregressive MLLM (Liquid) into a bidirectional diffusion denoiser is highly novel. The solution is simple (remove mask, add time embeddings), but its effectiveness is the key scientific finding, proving the power of MLLM priors for diffusion. 3. Table 4 in the appendix is a model for a good ablation. It provides definitive proof that the model's success is not just from using a big model, but from the specific combination of all three proposed components. The "w/ causal mask" ablation, which sees performance completely collapse (mFID 20.40 $\rightarrow$ 143.72), is a critical result that validates the entire premise. 4. The model is not just multi-task; it's a truly unified framework. The ability to jointly generate image-report pairs (Task 3) and then use that synthetic data to improve downstream models (Fig 5c) is a powerful demonstration of a generative foundation model for medicine. 5. Strong Baselines: The paper compares against a very strong and recent set of baselines, including models from 2024 and 2025 (e.g., MMaDA, Liquid, UniDisc, Med-Art, U-KAN), making its SOTA claims credible. 1. The paper's backbone is a Transformer-based MLLM (Liquid). While this is shown to be superior to a DiT backbone, the paper does not engage with the newest class of sequence models, State-Space Models (SSMs) like Mamba. It remains an open question if an SSM-based MLLM would be an even better or more efficient backbone for this diffusion task. 2. The entire method operates in a discrete token space, which fully depends on a high-quality VQ-VAE tokenizer. The paper uses one from "Chameleon". The quality of this tokenizer is a critical "hidden variable" that is not ablated. A poor VQ-VAE would likely cripple the model, and it's unclear how much of the visual fidelity is owed to this powerful VQGAN versus the diffusion model itself. 3. The “downstream” validation (Fig 5b/c) is limited to one task (report generation) and one baseline (R2Gen) . While promising, the claim of "improving downstream performance" would be more convincing if this synthetic data was shown to improve a wider range of tasks (e.g., VQA, segmentation) or more SOTA medical VLMs. 1. The performance jump from all other baselines is massive (e.g., 16.60 FID vs. 78.97 in Table 1). Is this gain solely from the model architecture, or is there a significant difference in training data or compute? Specifically, were the MLLM baselines (Liquid, MMaDA) fine-tuned on the medical datasets for a comparable number of steps (1M) as MeDiM? 2. Given the extraordinary FID score of 16.60, which far surpasses all baselines, could the authors provide a much larger, uncurated set of generated images in their supplementary material? This would help reviewers validate that these strong quantitative results correspond to consistent, high-fidelity, and semantically-aligned image generation, which standard metrics cannot fully capture. 3. The choice of a Transformer-based MLLM backbone (Liquid) is well-justified against DiT. However, have you considered alternative sequence model architectures, such as State-Space Models (Mamba), which are showing great promise for long-sequence modeling and efficiency, as potential backbones? 4. The framework's success depends on a high-fidelity image tokenizer. How sensitive is MeDiM's performance to the quality of the VQGAN? Have you experimented with other tokenizers, and how much of the SOTA image quality is attributable to the VQGAN from Chameleon versus the diffusion process itself? 5. For the downstream task evaluation, the generated pairs significantly boost R2Gen (a 2020/2021-era model) . Have you tested if this synthetic data can also boost the performance of a more recent, SOTA medical VLM? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next