ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 2.00 4.00 4028
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 6.67 4.33 3893
Total 4 (100%) 5.50 4.25 3927
Title Ratings Review Text EditLens Prediction
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper identifies a novel cause for overfitting in few-shot WSI classification: the distortion of low-dimensional feature manifolds by standard linear layers . The authors propose a plug-and-play "Manifold Residual (MR) block" that replaces these layers, using a fixed random matrix as a "geometric anchor" to preserve manifold structure and a separate low-rank pathway for task adaptation - The paper is built on a strong, clear, and insightful hypothesis. The diagnosis of overfitting as a geometric problem (i.e., manifold distortion by geometry-agnostic layers) rather than purely a data-scarcity problem is a novel and compelling contribution to the field. - The core hypothesis is well-supported by quantitative analysis before the method is introduced. The use of spectral analysis to show low effective rank (Fig. 1) and tangent space analysis to demonstrate both the manifold's curvature and its distortion by standard linear layers (Fig. 1) provides a solid and convincing foundation for the proposed solution. - Limited evaluation datasets. While many MIL methods are tested, the number of different types of tasks (classification only) and number of organs (limited to 3) is quite low for demonstrating a robust method improvement. - The tasks are also artificial few shot tasks. These tasks (e.g., NSCLC subtyping) have 1000s of data points, but the few shot splits are artificially sampled. I recommend trying some real few shot tasks, such as treatment response prediction. This type of task will always be few shot in nature and helping improve performance in this domain will carry tremendous benefit for the field, which is not true for a rather solved task of RCC and NSCLC subtyping. 10+ treatment response prediction tasks can be found at: https://huggingface.co/datasets/MahmoodLab/Patho-Bench - Why was $r=64$ used for the main comparison tables, while it is shown that performance saturates at $r=32$ (Fig. 3)? Does this choice, which doubles the parameters of the LRP, potentially understate the true parameter efficiency and performance of the MR block at its optimal rank? It may be useful to report the $k=16$ results for MR-CATE with $r=32$ in the text to show that the SOTA performance holds at this more theoretically-motivated rank. - Is a random matrix the optimal choice for preserving the specific, learned structure of a foundation model's feature manifold? For instance, would a fixed projection based on the principal components (PCA) of the training set features serve as a more "informed" (but still fixed) geometric anchor? Fully human-written
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 10: strong accept, should be highlighted at the conference Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The study proposes **Manifold Residual block** to address the issue of overfitting in few-shot whole slide images (WSI) classification. It argues that overfitting not just from data scarcity but a fundamentally geometric problem. Features from pathology foundation models lie on a low-dimensional, nonlinear manifold that linear layers in MIL models distort. 1) The study provides quantitative proof that CONCH features exhibit a low-dimensional manifold with nonlinear curvature, which linear layers disrupt. 2) The study proposes **MR Block Innovation** with a fixed random geometric anchor and a trainable low-rank residual pathway, reducing overfitting and parameter count. 3) The study provides a extensive validation to demonstrates the generalization of the proposed method. 4) **MR Block Innovation** demonstrates SOTA performances on three datasets across 4, 8, and 16 shots settings. 1) The study does not provide comparison with SOTA methods for whole slide images classification in few-shot settings such as MGPATH [3], MSCPT [2], FOCUS [3]. 2) The study does not report inference time and FLOPs for the proposed method. 3) The study does not fully explain the effective of rank on the model's performance. For example, the sensitivity analysis (Fig. 3) shows that performance saturates around a rank of **r=32**. The authors note this **aligns remarkably** with the features effective rank of 29.7. However, all main experiments in Table 1 and the ablation studies in Table 2 were run with **r=64**. In Fig. 3, NSCLC 8-shots, **r=64** performs worse than **r=32** or **r=48**, suggesting **r=64** may be a suboptimal. 4) The study lacks a clear description of how to apply **MR** block to complex methods such as TransMIL or CATE. **Reference**: 1. Nguyen, A.-T., Nguyen, D. M. H., Diep, N. T., Nguyen, T. Q., Ho, N., Metsch, J. M., Maurer, M. C., Sonntag, D., Bohnenberger, H., & Hauschild, A.-C. (2025). MGPATH: A vision-language model with multi-granular prompt learning for few-shot whole-slide pathology classification. Transactions on Machine Learning Research (2025). 2. Han, Minghao, et al. "Mscpt: Few-shot whole slide image classification with multi-scale and context-focused prompt tuning." IEEE Transactions on Medical Imaging (2025). 3. Guo, Zhengrui, et al. "Focus: Knowledge-enhanced adaptive visual compression for few-shot whole slide image classification." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. 1) how does the MR block perform if B is a non-random, fixed matrix, such as an identify matrix? 2) can you confirm if the same geometric distortion problem exists for pathology slide-level foundation models such as TITAN [1]? 3) Given this strong evidence for **r=32**, why were all main experiments (Table 1) and ablation studies (Table 2) run with **r=64**? 4) Could you elaborate on the methodology used to apply **MR** block to TransMIL and CATE? **Reference**: 1. Ding, T., et al. "Multimodal whole slide foundation model for pathology (2024)." URL https://arxiv. org/abs/2411.19666 2411. Fully human-written
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors introduce a novel layer to preserve low-dimensional manifold geometry within modern Multiple Instance Learning (MIL) frameworks for few-shot classification of whole slide pathology images. They first show that embedding spaces from well-known pathology foundation models manifest low-dimensional manifold geometries and that these are not well-preserved in popular attention-based MIL framework. Authors show that this tends to be due to linear layers such as those present within gated-attention mechanism used in ABMIL and they propose a novel layer, called the Manifold Residual (MR) block, to better preserve geometry. The latter is decomposed in 2 parts operating on a feature matrix $X$: (1) a fixed random matrix transforming linearly $X$ useful to preserve topology; (2) a trainable low-rank residual pathway (LRP). Authors study the theoretical properties of their method, demonstrate its relevance to improve many MIL models for few-shot classification tasks and perform a range of ablation studies. - The authors propose a relevant analysis to emphasise low-dimensional manifold properties of a range of foundation models for pathology. - Propose a novel layer for few-shot MIL, the MR block with a custom training strategy. - They provide theoretical results on a range of geometric/statistical properties preserved by perturbations by random matrices. - Demonstrate a universality approximation theorem for the MR block. - Show on 3 datasets that the MR blocks, instead of linear layers, within 5 MIL frameworks improve few-shot WSI classification while leveraging 3 different types of pretrained models (CONCH, UNI, ResNet50) - Perform ablations on the 2 parts included within the MR blocks, which tend to show that coupling these 2 parts brings the best performance. - Conduct several sensitivity analyses, to question where to replace linear layers with MT blocks within ABMIL, how to initialize the MR blocks and whether the MR blocks are robust to their rank hyperparameter. - **W1 : clarity** There are several points in the paper that would benefit from clarification and/or further detail: - a) L63: "linear layer". For people knowing the MIL literature it is not clear at this stage, about which linear layers you are referring to, e.g those included in the gated-attention layer of ABMIL or actually the linear classifier at the end of the architecture, which can also have an effect. This should be clarified. - b) The dataset used for the geometric studies reported in Fig 1 and 5 is never mentioned. - c) Figure 2: The box "MIL model" explaining how are supposed to intervene the MR blocks is really not clear. - d) Section 2.2: it is not clear to me why more generic few-shot learning paradigms/literature (see e.g [A]), applicable to any bag representations in MIL is omitted in the related work. For instance well-known prototypical neural networks [B] could be applied as a readout within ABMIL instead of a linear layer and their inherent dependence to distances could be a good proxy to preserve geometric properties. This observation underlines that it is not clear in the paper why different readout strategies are not discussed. I invite authors to do so during the rebuttal. - e) Section 3.1: While both instance-level MIL and embedding/bag-level MIL are mentioned in Section 2.1, Section 3.1 only formalizes bag-based approaches. It can be sufficient to mention that in Section 3.1 with a disclaimer that only bag-based methods are benchmarked in the paper. - f) To improve readibility of most tables, I suggest authors to express everything in % instead of 0.x. [A] Song, Y., Wang, T., Cai, P., Mondal, S. K., & Sahoo, J. P. (2023). A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Computing Surveys, 55(13s), 1-40. [B] Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Advances in neural information processing systems, 30. - **W2: benchmarks and ablations**: - a) Most tested datasets are significantly imbalanced hence I don't think that the choice of metrics such as the accuracy and AUC are the most appropriate. I believe that simply presenting macro F1 scores in the main paper could be sufficient to share the main messages, and potentially include AUPRC [C] for completeness in the main paper or supplementary. - b) In most ablation studies and sensivity analyses, many strong claims are made by authors when the results hold for at most 2 out of 3 datasets. Therefore I strongly encourage authors to include at least 2 other WSI datasets in their experiments to better support these claims. - c) Authors argue that a central issue of MIL methods for few-shot classification is overfitting. Nonetheless, there is a pletora of implicit or explicit regularizations (e.g dropout, attention dropout, norm constraints etc) that could be envisioned. Hence the scope of the baselines chosen by authors is not clear and should be further justified by authors or completed by including different regularization techniques. - d) Hyperparameters of benchmarked MIL models are not present in the paper and should be added. - **W3: asymptotic analysis**: While authors mention that their method brings less improvements in 16-shots WSI classification than with less supervision, it could be interesting to stress test their methods with higher ranges of shots on the larger datasets like TCGA-NSCLC. [C] McDermott, M., Zhang, H., Hansen, L., Angelotti, G., & Gallifant, J. (2024). A closer look at auroc and auprc under class imbalance. Advances in Neural Information Processing Systems, 37, 44102-44163. I invite authors to discuss the weaknesses mentioned above, knowing that I am really inclined to increase my initial grade. A last question: Q1. Could authors clarify whether there are correlations between geometric properties of the different datasets with the results observed in the ablation studies reported in Table 2, Table 3 and Figure 3? Fully human-written
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles few-shot whole-slide image classification by arguing that the root cause of overfitting lies in a geometric mismatch between pretrained pathology features and downstream linear classifiers. The authors propose a Manifold Residual block, which introduces a random geometric anchor and a trainable low-rank residual path to preserve manifold structure while reducing model capacity. Experiments across several MIL backbones show accuracy improvements and parameter reductions. The paper positions itself as introducing a geometry-aware inductive bias for few-shot computational pathology. 1. The paper identifies a real and practically significant issue in computational pathology. The connection between feature geometry and data efficiency is conceptually interesting and relevant to current efforts in adapting large pretrained models for medical imaging. 2. The proposed MR block is lightweight, easy to implement, and compatible with a wide range of MIL backbones. It can be viewed as a structured parameter-efficient adapter. 3. The paper reports consistent accuracy gains across multiple models with substantial parameter reductions. Major: 1. The paper attributes few-shot overfitting to the “destruction” of pretrained feature manifolds by downstream linear layers. This interpretation is not entirely convincing. Linear mappings are expected to reshape representations to achieve class separability, which is the very purpose of a classifier. The observed overfitting could instead result from limited data or excessive model capacity rather than geometric distortion. The causal link between ‘destruction’ and overfitting is not yet well established and could be further clarified with additional controlled experiments. 2. The proposed MR block introduces a fixed random matrix \(B\) as a geometric anchor. The t-SNE panel shows a non-trivial disagreement (~14%) in neighborhood structure. From an intuitive perspective, once the input features are multiplied by \(B\), much of the pretrained manifold structure and discriminative geometry are likely disrupted. Classifying on \(XB\) rather than on the original \(X\) would likely reduce performance. Even with sufficient data, a full-rank MIL training setup might not learn to counteract this direct perturbation of pretrained features, let alone a low-rank adaptation like LoRA. In contrast, linear layers transform pretrained features into a task-relevant space in a data-driven manner. Injecting random noise in this way fits more closely with the definition of “destruction” than “preservation.” 3. It is not entirely clear why extreme few-shot WSI classification is a key constraint here. The computational bottleneck typically lies in patch-level feature extraction and pretraining, not in the downstream classifier. Moreover, pretrained slide-level feature extractors such as TITAN already exist, which weakens the motivation for emphasizing few-shot adaptation at the classifier level. Minor: 1. The finding that pretrained pathology features exhibit low-dimensional manifold structure is broadly consistent with prior work in vision and contrastive representation learning. Classification layers are expected to transform pretrained features into spaces that better align with downstream tasks, which naturally alters the geometry. 2. The reported performance gains may primarily arise from substantial parameter reductions. The current experiments do not separate this effect from the claimed geometric contribution, making it difficult to assess which factor drives the improvement. 1. Could the authors disentangle the improvement due to parameter reduction from the claimed geometric preservation? A control experiment using an equally sized models would clarify this. 2. Does the MR block help when training data is not extremely limited? This would clarify whether the method primarily acts as a regularizer rather than a geometry-preserving transformation. 3. Please refer to the other weaknesses for additional concerns. Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next