|
Modular Multimodal Alignment using Time-Series EHR Data for Enhancing Medical Image Classification |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a method called MedCAM, which aims to integrate time-series EHR data to help medical image classification. A simple batch-wise alignment is proposed to achieve feature alignment between EHR and CXR.
- The problem of integrating multimodal data that this paper tries to tackle is significant.
- I suspect this paper is generated by LLM. Smoking gun: I can easily identify multiple non-existing literature cited:
- Qingsong Yao, Xiaoguang Ye, Shuo Wang, Yunfan Xue, Le Hu, Hanqing Wang, and Lin Shen. Drfuse: Multimodal fusion of electronic health records and chest x-rays for covid-19 outcome prediction. IEEE Journal of Biomedical and Health Informatics, 28(3):1327–1338, 2024b.
- Yuhao Zhang, Ruibo Fu, Nishal Shah, Maya Varma, Chao Xiao, Corey W Arnold, Christopher D Manning, and Curtis P Langlotz. Refers: Radiology report findings extraction and representation system. Nature Communications, 14(1):3067, 2023a.
- Ziling Zhang, Ruizhi Wu, Yongzhi Wang, Yinghao Wang, Chao Xiao, and David Sontag. Lmfusion: Efficient multimodal adaptation with language models for medical applications. arXiv preprint arXiv:2412.15188, 2023b.
- Jingfeng Yao, Yifan Wang, Yuxuan Li, Yifan Zhang, Yizhou Wang, Yutong Zhang, Xiaoqian Zhang, Xinyu Li, Yuxuan Chen, Jing Zhang, et al. Eva-x: A foundation model for general chest x-ray analysis with self-supervised learning. arXiv preprint arXiv:2405.05237, 2024a. URL https://arxiv.org/abs/2405.05237. (wrong author list)
- Clinical multimodal data fusion has been studied intensively in recent three years. This paper does not compare with any of the existing approaches.
- The proposed ideas of feature alignment using softmax and cosine similarity overly simplify the complex relationship between EHR and CXR, making the proposed method technically unsound.
- How do the proposed method compare against existing multimodal fusion methods, such as MedFuse and DrFuse? |
Fully human-written |
|
Modular Multimodal Alignment using Time-Series EHR Data for Enhancing Medical Image Classification |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
In my assessment, this submission appears to be substantially generated by LLMs. The direct evidence comes from **fabricated references**, for example:
1. Chaoyi Wu, Xiaoman Lin, Peixin Cao, Qiushi Wang, Weixiong Yu, Yan Xiang, Chunping Qu, Xiao Wang, Zhiqiang Liu, Xiangbo Meng, et al. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. arXiv preprint arXiv:2301.02228, 2023.
should be:
Wu, C., Zhang, X., Zhang, Y., Wang, Y. and Xie, W., 2023. MedKLIP: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In *Proceedings of the IEEE/CVF international conference on computer vision* (pp. 21372-21383).
2. Hong-Yu Zhou, Chenyu Chen, Cheng Qian, Yongwei Gao, Leilei Li, Ping Fan, Hao Fan, Kang Song, Xiaoguang Ye, Shuo Wang, et al. Advancing radiograph representation learning with masked record modeling. Nature Machine Intelligence, 5(9):957–969, 2023.
should be
Zhou, H.Y., Lian, C., Wang, L. and Yu, Y., Advancing Radiograph Representation Learning with Masked Record Modeling. In *The Eleventh International Conference on Learning Representations*.
3. Ziling Zhang, Ruizhi Wu, Yongzhi Wang, Yinghao Wang, Chao Xiao, and David Sontag. Lmfusion: Efficient multimodal adaptation with language models for medical applications. arXiv preprint arXiv:2412.15188, 2023b.
is fabricated.
Given these clear inaccuracies and indications of automated content generation, I believe **the manuscript does not meet the integrity standards for scholarly submission**. With due respect to the review process, I recommend desk rejection to conserve reviewer time.
In light of the apparent misconduct, I will not provide further comments on the technical quality of the paper.
Please refer to 'Summary'.
Please refer to 'Summary'.
Please refer to 'Summary'. |
Fully AI-generated |
|
Modular Multimodal Alignment using Time-Series EHR Data for Enhancing Medical Image Classification |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a framework to align x-ray data with EHR. Each modality has their own specific encoder, then a global + local alignment method is applied. Then finally the model is fine tuned on targeted tasks. It tests on datasets without EHR as out of distribution eval
- EHR alignment for medical image pretrianing is few so the motivation is strong
- On top of accuracy, the authors pay attention to computational cost, practicality for clinical deployment, using existing models effectively etc.
- the paper never rigorously enforces causal ordering between EHR and CXR timestamps. The temporal scope of EHR inputs is vaguely defined (“resampled at 2-hour intervals”) without specifying whether future observations relative to the imaging timestamp were excluded. In MIMIC-IV, many physiological measurements (e.g., labs, vitals) are backfilled retrospectively — meaning the EHR embedding could inadvertently encode post-image outcomes such as interventions or ICU status changes. This introduces temporal leakage, undermining the causal interpretability of the learned representations.
- It's still somewhat unclear how does this alignment actually help by conducting ablation study, e.g. via cross-modal retrieval accuracy, embedding similarity metrics, or attention interpretability. Further, the authors claim the adaptation captures “fine-grained interactions,” yet no visualization or ablation supports this. There’s also no control experiment comparing random EHR pairs or shuffled time windows — so it’s unclear whether the model is learning true cross-modal structure or merely benefiting from regularization.
- The theory is not systematically studied wrt why it works but merely empirically results. Dual contrastive loss (global/local) is not theoretically innovative. So further negative control study or failure case analysis is missing where something like shuffling EHR–CXR pairs, or perturbing EHR sequences temporally can help to understand more about the method
- For EHR, how do you ensure that EHR sequences used for alignment include only data prior to or concurrent with the CXR acquisition time
- Any quantitative evidence that global and local alignment help on what aspect individually |
Fully human-written |
|
Modular Multimodal Alignment using Time-Series EHR Data for Enhancing Medical Image Classification |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper extends existing medical CLIP approaches—originally trained on image–text pairs—to image–EHR pairs, investigating how replacing textual reports with EHR data as a semi-supervised signal affects the performance of chest X-ray image encoders. However, the methodology lacks significant novelty beyond the dataset modification, the experimental evaluation is incomplete, and the writing and formatting are unclear.
1. The paper provides a thoughtful analysis of prior medical CLIP methods that rely on image–text pairs and articulates the potential advantages of incorporating EHR information during training.
2. It integrates the MIMIC-CXR and MIMIC-IV datasets to construct a new multimodal benchmark.
**I. Writing and Presentation**
1. The manuscript includes no figure illustrating the model architecture or overall framework.
2. The experimental section is poorly organized. Section 5.1 is titled “Evaluation on the MIMIC-CXR Dataset,” yet Table 2 does not include results for image–text baselines on MIMIC-CXR; those results are instead presented in Table 4. This disjointed presentation hinders direct comparison and impairs readability.
3. Figure 1 is the only figure in the paper, which is of low quality, difficult to interpret, and conveys minimal information.
**II. Methodology and Experiments**
1. Aside from constructing a new dataset and substituting EHR for textual reports, the paper introduces no clear methodological innovation. The combination of global and local alignment strategies has already been thoroughly explored in prior works such as MGCA[1], GLORIA[2], and Prior[3].
2. The comparison with image–text models is methodologically inconsistent. The visual encoders of the baseline image–text models are trained from scratch, whereas MedCAM employs a pre-trained encoder. This discrepancy likely explains the observed performance gap: MedCAM significantly outperforms baselines on MIMIC-CXR and NIH but underperforms on CheXpert and VinDr.
3. The selected image–text baselines are outdated. The most recent cited work is MedCLIP (EMNLP 2022). At a minimum, the paper should include 2–3 stronger, more recent baselines from 2024–2025.
4. Recent studies such as MedBIND [4] and M3Bind [5] have already explored multimodal alignment involving more than two modalities (e.g., image, text, and EHR simultaneously). The choice to replace text with EHR—rather than integrating all available modalities—requires justification.
5. The paper lacks any form of qualitative or representational visualization (e.g., t-SNE plots), which would help assess the quality of learned embeddings.
1. Reorganize the manuscript’s writing and layout, particularly the design and placement of figures and tables, to enhance clarity and facilitate comparison.
2. Clearly articulate the novelty of the proposed approach beyond dataset construction.
3. Address the concern raised in Weakness II.2 regarding the unfair comparison due to differing training protocols (from-scratch vs. pre-trained encoders).
4. Include more recent and competitive baselines (e.g., BiomedCLIP, M3Bind, or other 2024–2025 medical multimodal models) in the experimental comparison.
5. Justify the design choice of replacing textual reports with EHR instead of jointly leveraging image, text, and EHR in a unified framework.
6. Add visualization experiments (e.g., t-SNE, attention maps, or embedding space analyses) to support the claims and improve interpretability.
[1] Wang, F., Zhou, Y., Wang, S., Vardhanabhuti, V., & Yu, L. (2022). Multi-granularity cross-modal alignment for generalized medical visual representation learning. *Advances in neural information processing systems*, *35*, 33536-33549.
[2] Huang, S. C., Shen, L., Lungren, M. P., & Yeung, S. (2021). Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In *Proceedings of the IEEE/CVF international conference on computer vision* (pp. 3942-3951).
[3] Cheng, P., Lin, L., Lyu, J., Huang, Y., Luo, W., & Tang, X. (2023). Prior: Prototype representation joint learning from medical images and reports. In *Proceedings of the IEEE/CVF international conference on computer vision* (pp. 21361-21371).
[4]Gao, Y., Kim, S., Austin, D. E., & McIntosh, C. (2024, October). MEDBind: Unifying Language and Multimodal Medical Data Embeddings. In *International Conference on Medical Image Computing and Computer-Assisted Intervention* (pp. 218-228). Cham: Springer Nature Switzerland.
[5]Liu, Y., Xi, S., Liu, S., Ding, H., Jin, C., Zhong, C., ... & Shen, Y. (2025). Multimodal Medical Image Binding via Shared Text Embeddings. *arXiv preprint arXiv:2506.18072*. |
Lightly AI-edited |