ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 6.00 4.00 2710
Lightly AI-edited 1 (25%) 8.00 2.00 3723
Fully human-written 2 (50%) 6.00 3.00 1783
Total 4 (100%) 6.50 3.00 2500
Title Ratings Review Text EditLens Prediction
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines Soundness: 3: good Presentation: 1: poor Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The given paper demonstrates, from a Bayesian theoretical perspective, that utilizing soft probabilistic outputs in Knowledge Distillation (KD) leads to the highest performance. Based on the SGD optimizer, it shows that when the teacher’s accurate Bayes Class Probabilities (BCPs) are used as target data, the performance surpasses that obtained using one-hot encoding. Furthermore, the paper analyzes how the deviation of errors and accuracy change as noise levels are added to the original BCP distribution. 1. Originality: The paper provides a mathematical proof, from a Bayesian theoretical perspective, explaining why using soft probabilistic outputs in KD leads to better performance under the setting of an SGD optimizer. As mentioned in the Related Work section, the authors generalize this theoretical result beyond special cases such as self-distillation or model compression to more general classification settings, which represents a clear contribution compared to prior research. 2. Quality: The paper offers mathematically rigorous explanations throughout all derivations, giving the theoretical sections a strong sense of completeness and internal consistency. 3. Clarity: The logical flow of the paper is well-structured, and the authors successfully connect theoretical findings with experimental results, presenting a coherent narrative from theory to practice. However, the extensive mathematical derivations make it somewhat difficult for readers to follow the core ideas and fully grasp the main contributions. 4. Significance: By providing a mathematically complete explanation for why soft probabilistic outputs improve performance in KD—a question that has not been sufficiently addressed in previous KD research—the paper offers meaningful theoretical significance within the field of knowledge distillation. 1. In Figure 1, it would be beneficial to further quantify the amount of noise and present this quantitatively. Moreover, based on the plots of generalization error and test accuracy per epoch, it seems that the results were obtained using a single random seed. If the authors were to test with multiple seeds and compute the variance of generalization error and test accuracy per epoch for the four cases, it could more clearly demonstrate that the true Bayes probabilities exhibit significantly lower variance. Additionally, instead of using abstract expressions such as “less noisy probabilities” or “more noisy probabilities”, it would be clearer to explicitly specify the exact noise levels, which would also enable a direct comparison with one-hot labels. 2. Although the paper’s overall contribution is meaningful, it is somewhat disappointing that it does not go beyond showing that using BCPs as labels for the student model improves training stability due to lower variance. 3. The transition from Equation (10) to Equation (11) omits too many intermediate steps, making it difficult to follow the derivation. Furthermore, mathematical expressions are overly complex, and the overall explanations feel somewhat unfriendly and difficult to read. 1. Does the proposed SGD_Based_Knowledge_Dist property still hold if optimizers other than SGD - such as Adam or RMSprop - are used? 2. Is the proposed method applicable only to classification tasks, or could a similar approach be extended to regression problems as well? 3. In Table 1, under what specific conditions were the tests conducted? (For example, were results averaged over multiple runs with different random seeds, such as Seed 0 to Seed 5?) 4. In line 272 on page 6, the paper states that “we model ϵ ∼ P_{ϵ} as zero-mean noise with variance ν and uncorrelated entries.” - How was the variance ν determined or chosen? Lightly AI-edited
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper carries out convergence analysis for SGD-based KD with Bayesian teacher and noisy Bayesian teacher, showing faster and more stable convergence compared to standard SGD. Based on the above analysis, the paper proposes to use BNNs as Bayesian teachers, either trained from scratch or converted from deterministic pretrained models. Experimental results are provided to validate the theoretical analysis and some performance gain is demonstrated. 1. The paper studies the benefit of Bayesian teachers in KD from the perspective of SGD convergence, which is a solid and principled choice. 2. The paper connects the literature on BNNs to KD through the use of Bayesian teachers. 1. The paper lacks high novelty. The Bayesian teacher perspective of KD has been well-studied in the literature and many results, both theoretical and empirical, have been presented to show that Bayesian teacher is optimal for student learning. The real important issue is how to pratically obtain a Bayesian teacher. However, the paper doesn't emphasize much on this issue, and simply resorts to some existing Bayesian DL methods. 2. The experimental results are not comprehensive enough. (1) All results are based on the standard KD, without extending to any latest logit-based distillers. (2) No result on ImageNet is presented. (3) No result for transformer-based models, e.g., ViT, is presented. 1. For converting a deterministic pretrained model into a BNN, is there any way to introduce stochasticity without modifying the teacher model itself (since in many cases, it's not desirable/feasible to modify the teacher model)? For example, through the use of data augmentation or adding auxiliary probabilistic modules to the teacher model. 2. The paper refers probability distributions that are closer to the BCPs as more "calibrated" probability distributions. Then, why not show some results on model calibration such as expected calibration error (ECE) and reliability diagram [1]. [1] C. Guo et al. On Calibration of Modern Neural Networks. ICML 2017. Fully human-written
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper analyzes SGD when the student is supervised by (i) exact Bayes class probabilities (BCPs) and (ii) noisy BCP estimates, then argues for Bayesian teachers as better BCP estimators. Core technical claims: (a) when supervising with exact BCPs, the empirical optimization interpolates and classical SGD “neighborhood” terms vanish, enabling larger admissible stepsizes (Thms. 1–2); (b) with noisy BCPs, convergence rates include a variance (gradient-noise) term whose magnitude depends on how close the teacher is to the true BCPs (Thms. 3–4 and Prop. 3). 1. Showing that the CE risk with BCP supervision shares the same minimizer as standard supervision (the Bayes posterior; the minimum equals (H(Y|X))), then establishing interpolation for the BCP-supervised objective (Props. 1–2), is crisp and well-grounded. 2. Thms. 1–2 remove the variance neighborhood term found in standard SGD and allow a wider stepsize range, formalizing a compelling optimization advantage of distillation from *accurate* probabilities. 3. The Dirichlet perturbation appendix helps ensure the targets remain on the simplex and shows the main conclusions persist. 1. Prop. 3 weights Jacobian norms by (1/P(y_k|x)) (or (1/P(y_k|x)^2) with noisy BCPs). If any class probability can be arbitrarily small, the gradient-noise bounds can blow up. You should make explicit an assumption like (P(y_k|x)\ge \epsilon>0) (or work with smoothed targets) and reflect this in all statements depending on Eqs. (13)–(14). 2. Additive perturbations can leave the simplex. While Appendix D covers a Dirichlet alternative, the main text should either use the Dirichlet model (preferred) or state an explicit projection/renormalization step and argue it does not break linearity in the second argument of CE used in the proofs. 3. Prop. 2 proves interpolation for the *BCP-supervised* objective under AS4. It would help to make explicit that interpolation is in the sense of *matching the Bayes distribution* (not zero training error on one-hots). Also, connect more tightly to Def. 1 and spell out that interpolation implies zero gradient at every sample (Eq. (25)), which underpins the disappearance of neighborhood terms. A short lemma bridging these steps in the main text (not only App. C) would aid readability. AS1/AS2/AS3 are standard in optimization but nontrivial for deep CE losses. Two concrete requests: (a) Explain where **expected smoothness** (AS3) comes from for CE with typical architectures (e.g., via bounded logits/Jacobians). (b) In Thms. 2–4 the stepsize bounds use “(\mu/(LL))” and “(\mu/(2LL))”. Please define “(LL)” or fix the notation—likely (L) vs. another constant (L'). As written, it’s ambiguous. Moderately AI-edited
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper aims to provide a theoretical analysis of logit-based knowledge distillation from a Bayesian perspective. The authors provide analyses for both perfect BCPs and noisy BCPs, drawing the conclusion that knowledge distillation can lead to variance reduction and neighborhood term removal. Based on the theoretical findings, the authors further propose to utilize Bayesian deep learning models to improve effectiveness. 1. This paper is well organized, highly detailed, and balanced between theoretical depth and readability. 2. The theoretical analysis is supported by empirical evidence. 3. There is potential practicality, as the authors also show the benefit of converting pre-trained models into BNNs to improve the effectiveness of knowledge distillation. 1. (Minor) The analysis is based on SGD, but the experiments are conducted on Adam. Although this can show that the analysis also applies to other SGD-related optimizers, it would be better if there were some analyses or at least citations to show such generalizability from a theoretical perspective. 2. (Minor) The experiments are based on image classification solely. Could there be more complex tasks, such as semantic segmentation or object detection? 3. (Minor) Some related work is recommended to be discussed, such as [1, 2] [1] ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence. ICML 2025 [2] f-Divergence Minimization for Sequence-Level Knowledge Distillation. ACL 2023. See Weaknesses. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next