ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 3 (75%) 5.33 2.67 1743
Lightly AI-edited 1 (25%) 2.00 4.00 2387
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 4.50 3.00 1904
Title Ratings Review Text EditLens Prediction
Multimodal Dataset Distillation via Phased Teacher Models Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. - The paper addresses dataset distillation, a key topic for enabling fast and cost-effective training of student models. - It proposes a novel dataset distillation method specifically designed to be effective in multimodal settings. - The core contribution is using intermediate checkpoints from a teacher model to distill datasets that capture the teacher's learning trajectory. - The method utilizes sets of teacher checkpoints to create the distilled dataset, rather than sampling from individual steps, to provide a more global view. - The method demonstrates promising empirical results, achieving strong and consistent improvements over the compared baselines. - The approach of using sets of teacher parameters (checkpoints) effectively provides the student model with a global perspective on the teacher's training dynamics. - The method requires access to intermediate teacher checkpoints. For very large models (the bigger CLIP variants), these are often unavailable, and replicating the teacher training to generate them introduces significant computational overhead. - The student model's architecture and specific training details are missing. See above. Moderately AI-edited
Multimodal Dataset Distillation via Phased Teacher Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper identifies a issue called "phased knowledge gap" in multimodal dataset distillation where student models fail to learn effectively from teacher models in the later stages of training. This paper then proposes Phased Teacher Model with Shortcut Trajectory (PTM-ST) to address this challenge. The core idea is to decompose the distillation process into phases, using different teacher models at each stage to provide more stable guidance. Additionally, it introduces a "Shortcut Trajectory" to create a smoothed and stabilized learning path for the student model. The proposed method is evaluated on Flickr30K and COCO datasets for image-text retrieval, demonstrating significant improvements over existing state-of-the-art methods. - Novelty. The paper identifies the "phased knowledge gap" as a critical issue in multimodal dataset distillation, where student performance degrades when using teacher models from later training stages. The proposed PTM-ST framework is a novel and effective solution that addresses this problem through phased learning and trajectory stabilization. - Strong empirical performance. The experimental results on Flickr30K and COCO datasets show that PTM-ST consistently and significantly outperforms existing baselines across all metrics. - Thorough analysis and ablations. The paper provides a comprehensive analysis of the problem, supported by visualizations of gradient instability and theoretical arguments for the proposed solution's stability. The ablation studies in Tables 3 effectively demonstrate the contribution of each component of the PTM-ST framework (PTM, ST, and EMA) to the overall performance improvement. - The PTM-ST framework introduces additional complexity, requiring manual specification of interpolation endpoints and matching ranges for each distillation stage. This may make the method difficult to apply to new datasets or tasks and raises concerns about hyperparameter sensitivity. - Limited Scope of Evaluation: The experiments are limited to image-text retrieval. It would be beneficial to evaluate the generalizability of PTM-ST on other multimodal tasks, such as Visual Question Answering (VQA) or image captioning. The authors are encouraged to compare with the following NeurIPS 2025 paper. Efficient Multimodal Dataset Distillation via Generative Models. Moderately AI-edited
Multimodal Dataset Distillation via Phased Teacher Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a data distillation technique that aims to compress large multimodal datasets into a small set of synthetic samples that can train new models to achieve comparable performance for multimodal learning purposes. The authors propose the Phased Teacher Model with Shortcut Trajectory (PTM-ST) approach, which divides the training process into different phases by splitting the teacher trajectory and applying shortcut alignment between them, achieving a 13.5% improvement in retrieval metrics and an average gain of 9.53% on Flickr30K. * Strong motivation. The authors explore phased knowledge gaps in multimodal dataset distillation by investigating the insight differences in distillation dynamics between multimodal and unimodal data distillation techniques, offering valuable finding into modality-specific learning behavior. * Proposes a simple yet effective phased distillation strategy (PTM-ST), which divides the learning process into semantic phases and aligns them with the teacher’s learning dynamics for trajectory endpoint matching, improving overall distillation stability and performance. - **Limited data scale and generalizability evidendence:** The experiments are conducted on small datasets (e.g., Flickr30K, COCO) with limited computational resources, which may not capture the knowledge-gap behavior at large scale. As a result, the observed early-phase effectiveness and multi-stage training strategy may not generalize to full-scale CLIP-style pretraining. - **Additional training cost and model complexity:** Requires more training and model complexity. Although the method aims to improve efficiency, it still requires fully training a large teacher model and dividing it into multiple phases, which substantially increases both computational and storage costs - **Narrow experimental scope and weak comparison**: The experiments are confined to small benchmarks and a single teacher architecture (CLIP-base), without comparisons against stronger or larger vision–language models. This narrow setup limits the evidence supporting that “information-gap phasing” or session partitioning is a generally effective approach. It remains unclear how the proposed phased-teacher strategy generalizes under large-scale training. How do the observed “phase gaps” in representation or gradient dynamics behave when the dataset size grows substantially? Lightly AI-edited
Multimodal Dataset Distillation via Phased Teacher Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes PTM-ST, which aims to address the instability of knowledge transfer from the teacher model during different training phases in MDD. The core idea lies in combining PTM and ST to capture the evolving knowledge of the teacher model across different training stages and to smooth its optimization trajectory, thereby enhancing the stability and effectiveness of knowledge distillation. The paper conducts the first systematic analysis of the phenomenon of “phase-wise knowledge drift” in multimodal distillation, and validates its existence through both theoretical derivation and empirical evidence. The proposed method is evaluated on two mainstream datasets, Flickr30K and COCO, with consistent and significant performance gains demonstrated through comprehensive ablation studies. 1. The PTM-ST framework appears to be highly sensitive to the choices of phase partition ( P ), matching intervals, and endpoint selection. If these hyperparameters are not properly tuned, will the model’s performance collapse or degrade significantly? Can these parameters be made adaptive rather than manually specified? 2. Your Proposition 1 relies on the second-order smoothness assumption, which is often invalid for large Transformer architectures. How do you justify the practical relevance of this assumption, and have you conducted any empirical validation or relaxation of it? 3. The proposed method still requires access to the entire teacher training trajectory, which seems computationally expensive. If only partial teacher checkpoints or commercial API-based teacher models are available, can your method still be applied effectively, or does it fundamentally depend on the full training process? See wekness. Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next