|
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces TEMMED-BENCH, a benchmark for assessing temporal medical image reasoning for LVLMs. Each test item pairs a historical and a current chest radiograph from the same patient and evaluates models on three tasks: binary VQA about condition change, change-focused report generation, and an image-pair selection task where the model must pick the pair that matches a specified change statement. The paper also proposes pairwise image retrieval for multi-modal RAG, which retrieves image-pairs (and their reports) whose historical/current images are jointly similar to the query pair. The result shows that several LVLMs benefit more from multi-modal RAG than text-only RAG. Overall, most LVLMs perform near chance on several tasks; reasoning-oriented proprietary models fare best, but none are strong yet on temporal change analysis.
- The motivation is crisp and clinically grounded: real radiology practice is often longitudinal, while most benchmarks are single-visit.
- The proposed image-pair selection task is novel to me. It is also more vision-centric than typical multi-choice VQA and stresses multi-image reasoning.
- The paper is clear, with concrete task definitions, corpus statistics, and straightforward evaluation protocols.
**1. Some claims are to be clarified with discussion**
- Please check the claim “the first benchmark that focuses on evaluating the temporal reasoning ability of LVLMs”, since there are existing works (e.g., MedFrameQA [1], MMXU [2], MedMIM [3], ICG-CXR [4]) that also feature in temporal evaluation by gathering longitudinal studies from a pool of imaging studies). The authors should discuss how their work differ from those.
[1] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning (arxiv 2025.05)
[2] MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression (ACL 2025)
[3] Medical Large Vision Language Models with Multi-Image Visual Ability (MICCAI 2025)
[4] Towards interpretable counterfactual generation via multimodal autoregression (MICCAI 2025)
- In Line 201, the authors claim that “We randomly selected 1,000 instances as the test set and used the remaining instances as the knowledge corpus.” I am unclear whether the split is based on patients or instances (the images from a same patient can appear in multiple instances). If patient overlap exists between test set and knowledge corpus, there might be near-duplicate visits from the same patient and the RAG gains might be inflated. I suggest the authors check the split and see if the results need to be re-evaluated.
**2. Some important details are missing**
- During dataset curation, from multiple longitudinal images from the same patient, how are each image pairs in each case chosen? Are these image pairs from consecutive examinations (e.g., study1-study2, study3-study4), from non-consecutive examinations (e.g., study1-study3, study2-study4), or from both?
- What is the distribution of time gaps between historical/current images, and how do performance and error types vary by short vs. long intervals?
- While the image-pair selection task is novel to me, I am curious how this task is done by existing LVLMs since the models may not natively support this task. Specifically, how do the authors modify the LVLMs so they can percept the order of input image-pairs (“A”, “B”, “C”)? Do the authors directly concatenate tokens of different image pairs together? Does that need an extra level of order position encoding?
- In Lines 223-226, could the authors provide some statistics that leads to this observation? (e.g. how many reports match the regular expression) This can benefit follow-up research on CXR report analysis.
- Which image encoders are used in the pairwise retrieval score?
Please see "Weaknesses" |
Fully human-written |
|
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This study proposes TemMed-Bench, the first benchmark specifically designed for evaluating Large Vision-Language Models' (LVLMs) ability to reason over temporal medical images by tracking changes in patients' conditions between different clinical visits. The benchmark consists of a test set across three distinct tasks:
1. Visual Question Answering (VQA): Challenges models to answer binary "yes" or "no" questions about condition changes between a historical and current image pair.
2. Report Generation: Requires models to generate a report analyzing condition changes observed between a historical and current image.
3. Image-pair Selection: A vision-centric task where models must select the image pair (from three options) that best matches a given medical statement describing a condition change.
They further includes a knowledge corpus of over 17,000 instances to support RAG. To enhance retrieval quality for temporal reasoning, the study introduces a novel "pairwise image retrieval method," which computes similarity based on both historical and current images to ensure that retrieved instances reflect similar condition changes, thereby enabling multi-modal retrieval augmentation. The evaluation reveals significant limitations in current LVLMs' temporal reasoning abilities and highlights the effectiveness of multi-modal RAG.
1. Large-scale Benchmark Construction: A total of 18,144 instances were constructed using an automated data collection pipeline applied to the CheXpert Plus dataset.
2. Multimodal RAG: The study introduces a multimodal RAG approach, incorporating a pairwise image retrieval method that considers similarity between historical and current images.
3. Temporal Reasoning Evaluation: Beyond single-visit image interpretation benchmarks, this work evaluates models’ ability to infer condition changes over time by comparing historical and current medical images.
This study makes a valuable contribution to evaluating temporal medical image reasoning in LVLMs. However, it exhibits several limitations:
1. Restricted Scope of "Temporal" Reasoning: The benchmark limits temporal analysis to a binary comparison between the current image and only the most recent prior visit. This simplified setting does not evaluate long-term progression, intermittent conditions, or multi-visit trends typical in clinical practice, reducing realism and the complexity of the reasoning task.
2. Lack of Robustness and Transparency in Data Collection: The dataset relies on handcrafted regular expressions to extract “condition change” sentences from medical reports, without any quantitative validation (precision, recall, F1) to assess extraction accuracy. Considering the variability of clinical language, this approach risks misinterpretation, extraction errors, and bias, ultimately harming dataset quality and representativeness. Clear documentation of the dataset extraction process is needed, along with transparent analysis of coverage and missed cases. Quantitative evidence demonstrating extraction performance would strengthen confidence in the dataset.
3. Insufficient Transparency and Expertise in Human Review: While the paper notes that human reviewers validated AI-generated VQA question-answer pairs, additional details about the review process would help support the reliability of this validation. Information such as the number of reviewers, their clinical expertise, review guidelines, and inter-rater agreement metrics (e.g., Cohen’s Kappa) would strengthen the methodological transparency.
Were data ethics maintained in the proprietary LVLMs usage? |
Moderately AI-edited |
|
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents TemMed-Bench, a new benchmark for evaluating temporal reasoning in medical vision-language models. Unlike previous datasets that test single-image understanding, TemMed-Bench requires models to compare image pairs from different patient visits to assess disease progression or improvement. It includes three task types: VQA, report generation, and image-pair selection. The authors evaluate open and proprietary LVLMs and find that most fail to perform reliable temporal reasoning. Incorporating multi-modal retrieval (image + text) improves performance over text-only retrieval but only modestly.
- **High practical relevance**: The proposed benchmark reflects real clinical workflows where patient history and temporal comparison are essential for diagnosis.
- **Comprehensive benchmark design**: It combines multiple task types (VQA, report generation, image-pair selection) to test LVLMs in medical reasoning from different perspectives.
- **Clear writing**: The paper is well written and clearly structured. The figures and tables are visually clear and effectively support the main points.
- **Benchmark difficulty**: The best-performing model, GPT-4o mini, already achieves around 80% accuracy (Table 3 and 4), suggesting that parts of the benchmark may not be very challenging for advanced models. In my opinion, this raises concerns about the benchmark’s long-term difficulty and its ability to differentiate performance among next-generation LVLMs.
- **No human expert validation**: The benchmark lacks human radiologist agreement or interpretability checks to confirm tasks and annotation correctness. Although the paper mentions a manual quality control for the VQA dataset, it focuses only on consistency checking rather than clinical verification by radiologists.
- **Limited task diversity beyond radiology**: The benchmark focuses mainly on chest X-ray data (CheXpert Plus), limiting the generalization to other modalities such as CT, MRI, or ultrasound.
1. The paper shows that multi-modal RAG (combining image and text retrieval) improves performance over text-only retrieval. Could the authors clarify how the retrieved visual and textual evidence are integrated into the LVLM’s reasoning process?
2. How do the authors ensure that retrieval method does not introduce data leakage or spurious correlations from similar patient cases within the same dataset?
3. Given that GPT-4o mini already reaches around 80% accuracy, how likely is the benchmark to remain challenging and discriminative as future models improve? |
Moderately AI-edited |
|
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces TEMMED-BENCH, a benchmark to assess temporal reasoning by comparing historical vs. current medical images, demonstrates that multi-modal RAG substantially boosts performance, and includes manual QC for VQA, yet its single-modality/single-source design, limited failure-case analysis, and lack of broader SOTA baselines leave it vulnerable to score gaming and constrain generalizability.
1. Presents a benchmark that explicitly compares historical vs. current medical images to judge disease progression.
2. Systematically explores the impact of multi-modal RAG and demonstrates that it can markedly enhance task performance, providing a practical pathway for future agents/medical LLMs to handle such problems.
3. Includes manual quality control for the VQA subset, correcting approximately 10% of items and thereby improving the benchmark’s credibility.
4. Compared with datasets like U-Xray and MIMIC-CXR, this benchmark emphasizes reasoning grounded in retrieved evidence, making it less vulnerable to simple pattern-matching hacks.
1. The image modality is overly narrow, and the data source is single and easy to get; as a benchmark, it is easy to be reverse engineered and reward hacks, which may make it become obsolete quickly.
2. The paper mainly offers macro-level discussion of “why this is hard” and “which settings cause models to fail or become ineffective,” but lacks a systematic, category-wise failure analysis with concrete error cases. Since a benchmark is meant to drive LLM progress in a specific domain, the work should provide insights explaining why certain models fail.
3. Many evaluated LVLMs are not SoTA in 2025. For example, only Gemini 2.5 Flash was tested (not Gemini 2.5 Pro), and Claude 3.5 Sonnet (not 3.7 or later). Newer models should be included. Several open-source baselines are also relatively old, e.g., LLaVA-Med.
Minor:
2. The benchmark’s “temporal reasoning” setup is essentially **two-timepoint differencing**, rather than true longitudinal reasoning about disease progression over time.
Please answer the following questions:
1. Because the dataset comes from a single, easily accessible source, the benchmark appears highly vulnerable to reverse engineering and score gaming. Do you have any safeguards to prevent this?
2. Please provide a failure-case analysis with concrete examples, and consider adding this section in future revisions.
3. If possible, include results from more advanced models to strengthen the evaluation. |
Lightly AI-edited |