|
Real-time Echocardiography Video Segmentation via Slot Propagation, Spatiotemporal Feature Fusion, and Frequency-phase Enhancement |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
While the paper addresses an important and clinically relevant problem, the proposed method however lacks sufficient novelty. The core components of the architecture (slot-based propagation, spatiotemporal fusion, and frequency-domain enhancement) are either direct adaptations or incremental combinations of techniques that have already appeared in the literature, particularly in medical image/video segmentation and foundation model adaptation domains. Below, I detail specific concerns regarding methodology.
1. well organized and clear motivation
While the paper addresses an important and clinically relevant problem, the proposed method however lacks sufficient novelty. The core components of the architecture (slot-based propagation, spatiotemporal fusion, and frequency-domain enhancement) are either direct adaptations or incremental combinations of techniques that have already appeared in the literature, particularly in medical image/video segmentation and foundation model adaptation domains.
### 1. Limited novelty
The paper claims that its “context-guided slot propagation (CGSP)” mechanism is a key innovation for separating foreground and background regions in noisy echocardiographic videos. However, slot-based representations for object-centric learning and video segmentation have already been extensively studied in previous works [1–8]. The authors have not clearly articulated how this manuscript differs from or advances beyond these prior studies.
The SFF module aggregates features from reference and query frames using query-key-value attention, a pattern now very common in video segmentation such as [6,9,10-14]. For example, XMem [6] and its successors XMem++ [9] already employ cross-frame attention with memory banks to fuse spatiotemporal context efficiently. The prototype-based matching in Eq. (12) – (16) closely resembles the feature correlation and readout mechanisms in STM [11], which widely cited in video object segmentation, and the similar design can also be found in medical image domain such as [12], [13], [14]. Thus, the SFF module offers no a novel architectural or theoretical departure from established paradigms.
The FPE module applies FFT, modulates amplitude/phase with learnable masks, and uses IFFT to reconstruct features as a strategy that has seen multiple recent instantiations: [15] and [16] both exploit frequency-domain filtering or noise-robust tuning for image segmentation, explicitly addressing generalization problem in ultrasound. Frequency-aware SAM variants, such as [17], already integrate frequency priors into SAM backbones for enhanced boundary delineation under noise, which directly overlapping with the motivation of FPE.
### 2. Missing comparison with SOTA methods.
Several SOTA methods are compared under inconsistent experimental conditions:
The paper uses MiT-b2 (SegFormer backbone), which is significantly more powerful than the backbones used in many cited baselines (e.g., U-Net in early SAMUS variants, ResNet in XMem). Yet, the authors do not re-implement or re-benchmark these methods with the same backbone for a fair comparison.
### 3. Mirror Weakness
- Missing Statistical Significance and Variance Reporting
- High FLOPs and Parameter Count Undermine “Real-Time” Claim. As shown in Table 2, FESPNet has 370 GFLOPs and 34.3M parameters, which is: ~24× higher FLOPs than SimLVSeg (3G), ~3× higher FLOPs than PKEchoNet (158G) , yet only achieves marginal mDice gains. This performance even higher than Cutie (218G) and Swin-UMamba (340G), both of which are already considered heavy for real-time medical applications.
### 4. Typos
- 053 acorss frames -> across frames
- 228 a new feature map FS ∈ RK×H×W, where N is the number of slots, where is N?
- 240 ,its featurer presentation H_{Si}∈R_L, where is L?
[1] Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A. and Kipf, T., 2020. Object-centric learning with slot attention. Advances in neural information processing systems, 33, pp.11525-11538.
[2] Lee, M., Cho, S., Lee, D., Park, C., Lee, J. and Lee, S., 2024. Guided slot attention for unsupervised video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3807-3816).
[3] Liao, G., Jogan, M., Hussing, M., Zhang, E., Eaton, E. and Hashimoto, D.A., 2025, September. Future slot prediction for unsupervised object discovery in surgical video. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 219-229). Cham: Springer Nature Switzerland.
[4] Madan, S., Chaudhury, S. and Gandhi, T.K., 2024, November. Pneumonia Classification in Chest X-Ray Images Using Explainable Slot-Attention Mechanism. In International Conference on Pattern Recognition (pp. 271-286). Cham: Springer Nature Switzerland.
[5] Deng, X., Wu, H., Zeng, R. and Qin, J., 2024. Memsam: Taming segment anything model for echocardiography video segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9622-9631).
[6] Bekuzarov, M., Bermudez, A., Lee, J.Y. and Li, H., 2023. Xmem++: Production-level video segmentation from few annotated frames. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 635-644).
[7] Jaegle, A., Borgeaud, S., Alayrac, J.B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E. and Hénaff, O., 2021. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795.
[8] Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A. and Carreira, J., 2021, July. Perceiver: General perception with iterative attention. In International conference on machine learning (pp. 4651-4664). PMLR.
[9] Cheng, H.K. and Schwing, A.G., 2022, October. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European conference on computer vision (pp. 640-658). Cham: Springer Nature Switzerland.
[10] Maani, F., Ukaye, A., Saadi, N., Saeed, N. and Yaqub, M., 2024. SimLVSeg: simplifying left ventricular segmentation in 2-D+ time echocardiograms with self-and weakly supervised learning. Ultrasound in Medicine & Biology, 50(12), pp.1945-1954.
[11] Oh, S.W., Lee, J.Y., Xu, N. and Kim, S.J., 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9226-9235).
[12] Wang, R. and Zheng, G., 2024. PFMNet: Prototype-based feature mapping network for few-shot domain adaptation in medical image segmentation. Computerized Medical Imaging and Graphics, 116, p.102406.
[13] Yuan, Y., Wang, X., Yang, X. and Heng, P.A., 2024. Effective Semi-Supervised Medical Image Segmentation With Probabilistic Representations and Prototype Learning. IEEE Transactions on Medical Imaging.
[14] Kim, H., Hansen, S. and Kampffmeyer, M., 2025, September. Tied Prototype Model for Few-Shot Medical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 651-661). Cham: Springer Nature Switzerland.
[15] Chen, L., Fu, Y., Gu, L., Zheng, D. and Dai, J., 2025. Spatial frequency modulation for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[16] Wei, Z., Wu, C., Du, H., Yu, R., Du, B. and Xu, Y., 2025, September. Noise-Robust Tuning of SAM for Domain Generalized Ultrasound Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 476-486). Cham: Springer Nature Switzerland.
[17] Kim, S., Jin, P., Chen, C., Kim, K., Lyu, Z., Ren, H., Kim, S., Liu, Z., Zhong, A., Liu, T. and Li, X., 2025. MediViSTA: Medical Video Segmentation via Temporal Fusion SAM Adaptation for Echocardiography. IEEE Journal of Biomedical and Health Informatics.
Please find my comments above. |
Fully AI-generated |