|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a prototype-guided patch-wise replay strategy for medical image segmentation: (1) CLIP-based text–image fusion to incorporate semantic priors, (2) prototype-based scoring to identify moderately forgettable samples, and (3) an online replay buffer to revisit them during training. The method is simple to implement and is evaluated on five datasets (≤5 classes). Ablations and sensitivity analyses are clear; improvements are consistent but generally small.
1. Clear, readable paper with a straightforward method.
2. Well-designed ablations and sensitivity studies that isolate replay frequency, prototype size, and CLIP fusion.
3. Consistent (though small) gains across datasets without heavy architectural changes.
1. The absolute Dice improvements are marginal. The paper needs multi-seed runs with statistical tests to establish significance.
2. Evaluation scope is narrow (five small-class datasets), missing large multi-organ benchmarks (BTCV, AMOS, TotalSegmentator v2) to test scalability and class-wise robustness.
3. Baselines are incomplete: missing strong or hybrid models (TransUNet [1], MedNeXt [2], EMCAD [3], etc.).
4. No discussion and comparison with established prototype- or memory-replay methods for segmentation or continual learning under a shared protocol.
5. CLIP text encoder is frozen and general-domain; fine-tuning or using BioMedCLIP may improve alignment.
6. Only 2D UNet-style backbones are tested; impact on pretrained hybrids/transformers (TransUNet [1], EMCAD [3]) is unknown.
7. Impact on interactive foundation models (e.g., Med-SAM) is untested but promising.
[1] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L. and Zhou, Y., 2021. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
[2] Roy, S., Koehler, G., Ulrich, C., Baumgartner, M., Petersen, J., Isensee, F., Jaeger, P.F. and Maier-Hein, K.H., 2023, October. Mednext: transformer-driven scaling of convnets for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 405-415). Cham: Springer Nature Switzerland.
[3] Rahman, M.M., Munir, M. and Marculescu, R., 2024. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11769-11779).
1. Are the gains statistically significant over multiple seeds? Please report mean±std and appropriate significance tests per dataset.
2. How does the method perform on large multi-organ datasets, such as BTCV, AMOS, TotalSegmentator v2, including per-class results under strong imbalance?
3. What is the effect of integrating the replay mechanism into pretrained hybrids models (TransUNet [1], EMCAD [3])?
4. Does fine-tuning CLIP's text encoder or swapping to BioMedCLIP improve results?
5. How does this approach compare to established prototype and memory-replay baselines under an identical training pipeline?
6. Could replay be combined with Med-SAM to guide interactive segmentation? |
Fully AI-generated |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a prototype-guided, CLIP-informed framework for medical image segmentation that identifies and replays moderately forgettable samples (patches that lie near decision boundaries and are prone to being forgotten during training). The approach combines three modules:
- Text-Guided Fusion (TGF), which incorporates CLIP text embeddings to guide visual prototype formation.
- Prototype-Based Scoring (PBS), which measures sample difficulty via intra-/inter-class distances and confidence-based metrics.
- Forgettable Sample Bank (FSB), which maintains and replays informative samples to reinforce learning.
Experiments on five public datasets (KiTS2023, BraTS2020, ACDC, FLARE2021, PROMISE2012) show consistent gains in Dice and sensitivity, and lower Hausdorff distances than baselines like nnU-Net, Attention U-Net, and MambaUNet.
- The paper addresses an important task in medical image segmentation: how make models robust at low-contrast regions.
- The proposed method is innovative and effective: particularly, using CLIP text embeddings for *training-time* guidance and using PBS and FSB to keep the training focus on hard cases.
- Comprehensive evaluation across diverse datasets. The Result section is also informative. Strong performance compared to baselines.
- The writing is clear and easy to follow.
- The explanation of how CLIP contributes during training is unclear. The statement “CLIP semantic guidance provides discriminative information beyond visual appearance” is overly general and does not specify the mechanism by which CLIP influences feature learning. It would strengthen the paper to include feature-space visualizations (e.g., t-SNE or UMAP plots) comparing models trained with and without text-guided fusion, to demonstrate the effect of CLIP guidance on representation structure. In addition, the discussion of prompt design is limited. It would be useful to analyze how different prompt formulations affect training and whether the observed performance gain is robust to prompt variation.
- Although CLIP is frozen, its bias toward natural image semantics may limit robustness in rare or pathology-heavy datasets. Some comparison with medical-domain text encoders (e.g., MedCLIP, BioCLIP) would clarify sensitivity to text priors.
- Table 3 lumps PBS and FSB together in some configurations. Independent ablations would better clarify each module’s role. For this, the authors may consider comparing PBS with existing sample-scoring methods by substituting one of them for PBS in the framework.
- Regarding $Score^b$ and forgettable samples: although the intuition behind the formulation of $Score^b$ is clear, its relationship to the true forgetting frequency (as per Toneva et al., 2019) is not quantitatively demonstrated. A correlation plot or ablation on actual forgetting events would strengthen the claim.
- Were there any experiments done on prompt design? How sensitive is the method to different prompts?
- Would domain-specific encoders (e.g., MedCLIP, BioLinkBERT) provide similar or better benefits than CLIP?
- About $Score^b$, were there experiments exploring or tuning the weights assigned to its components?
- Have the authors compared PBS against standard hardness metrics such as loss magnitude, gradient norm, or prediction entropy to show unique benefit? |
Heavily AI-edited |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes an end-to-end framework for medical image segmentation that mines moderately forgettable (hard-positive) samples to reduce false negatives and improve boundary accuracy. Specifically, it introduces CLIP-based text embeddings to guide prototype learning for semantically richer features, defines a multi-metric difficulty measure to score prototypes, and uses an online forgettable sample bank to dynamically store and replay difficult samples for curriculum-like retraining. Experiments on five public datasets show improvements over baselines. Ablation show each module’s contribution.
1. This paper focuses on a previously underexplored problem and introduces moderately forgettable sample mining guided by CLIP semantics.
2. This paper proposes a multi-metric prototype-based score that balances geometric and probabilistic cues.
3. Extensive experiments and ablations.
1. All experiments rely on 2D patch training, even for 3D datasets.
2. The online bank and prototype updates likely introduce overhead.
1. Motivation & Novelty
1.1 Clarity of “Moderately Forgettable Samples”
The concept of “moderately forgettable samples” is central to this paper, but its definition remains informal. The authors should provide a clearer, quantitative criterion to demonstrate these samples from easy or noisy ones. Moreover, it remains unclear how the proposed method guarantees that the identified samples correspond to clinically meaningful hard positives rather than mislabeled or ambiguous regions.
1.2 Technical Novelty and Contribution.
The proposed components (text-guided fusion, prototype-based scoring, and a replay memory bank) individually build upon well-established ideas. The authors should better highlight what is fundamentally new in their formulation or analysis compared with prior works on hard-sample mining, prototype learning, or CLIP-based semantic guidance, especially to appeal to a broader ICLR audience beyond medical imaging.
2. Method
2.1 Section 3.2 (Prototype-Based Scoring)
(1) Line 235, “Our approach addresses these limitations by leveraging semantically-enhanced prototypes to provide both computational efficiency and semantic-aware patch-level scoring.”
This claim requires justification. How does semantic enhancement improve efficiency rather than add overhead? A brief complexity analysis or runtime comparison would clarify this point.
(2) Line 278, “The four terms are normalized by the number of pixels to bring them to a comparable scale.”
Please analyze how this normalization affects the relative weighting among metrics, particularly for organs of different sizes. Could this bias the difficulty estimation toward small or large structures?
2.2 Sect. 3.3 Forgettable Sample Bank
How does performance vary with different bank sizes, and what trade-offs exist between memory cost, sample diversity, and replay stability? A more systematic guideline or sensitivity curve would strengthen this part.
3. Experiments
3.1 The paper shows strong overall results but does not clearly isolate the effect of CLIP-based semantic fusion on prototype learning. Visualization or quantitative analysis would help demonstrate how text guidance improves the representation quality.
3.2 How sensitive is the framework to the choice of frozen CLIP backbone (ViT-B/32 vs ViT-L/14)?
3.2 The method introduces additional modules. What is the computational overhead (memory and runtime) compared with nnU-Net baselines? This will clarify the practicality of deploying the framework in real clinical workflows. |
Fully AI-generated |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a medical image segmentation framework that utilizes a hard-sample patch-wise replay method guided by prototypes, and incorporates CLIP text embeddings for encoder-decoder feature fusing in the U-Net architecture. While the motivation of addressing hard samples which are near the decision boundary is relevant, the paper suffers from method originality and is lack of convincing justification for its core components.
1. Addressing the issue of hard-positive samples / hard samples that are near the decision boundary is an important direction in the field.
2. The authors conduct experiments across five datasets, covering different anatomical structures and modalities.
Lack of novelty of the core method designs: the idea of using CLIP text embedding to facilitate medical image segmentation has been heavily explored these years, such as [1-3], there are no significant differences suggesting that this approach is innovative. And the TGF module can be regarded as a type of attention-gated mechanism.
[1] CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection (ICCV 2023)
[2] PCNet: Prior Category Network for CT Universal Segmentation Model (TMI 2024)
[3] Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation (MICCAI 2025)
1. My biggest concerns are set out in the weaknesses section.
2. Although the authors state that "The core question is not architectural but strategic: how to define sample difficulty" and adapt the same 2D patch-based framework and only test the proposed strategy on the nnUNet backbone, I think it's important to validate the generalization ability of the strategy across different architectures (as those compared in Table 1), and to avoid possible over-optimization problems.
3. The CLIP text embedding may have a huge gap between natural image descriptions and medical images, and the performance gain may simply be the result of adding a powerful, high-dimensional, pre-trained feature vector that acts as a strong form of regularization or feature enrichment, rather than providing true *semantic guidance* derived from the text input. Without deeper analysis demonstrating that the text features align with medical concepts (e.g., via visualization or linear probing), the claim of semantic guidance is unconvincing (as those stated around line 198-200).
4. If evidence is provided for question (3). Will medical-tuned/oriented CLIPs provide better performance under the framework?
Minor:
1. Provide baseline results for ablation study results (Table 3)
2. Missing highlight (bold results for DSC) for PROMISE2012 in Table 1 (line 351-352) |
Fully human-written |