ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 4 (100%) 2.50 3.50 2296
Total 4 (100%) 2.50 3.50 2296
Title Ratings Review Text EditLens Prediction
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies SGD's impact on generalization for machine learning models. Based on the provided analyses, it proposes two regularization schemes, which are shown to benefit generalization for a few toy datasets. The question raised in the paper is important and the paper tsts a new regularization method based on the analyses and shows that it might benefit generalization The theoretical contribution appears to be incremental, as, to my understanding, the main insights came from Smith et al. (2021). The empirical evaluation is very limited, as the results are tested only on a very specific synthetic dataset with a sparse prior and FashionMNIST. 1) I did not understand how the analyses are specific to the SGD as opposed to the non-stochastic GD. As the opening sentence of the abstract mentions the difference between generalization of GD and SGD as a motivation, I would like to ask the authors to elaborate more on this. How can we see from the bounds derived in the paper that SGD might outperform GD? 2) As for the regularizers part, what are the novel insights made in the paper compared to Smith et al. (2021)? Fully human-written
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tries to understand SGD from the view of bootstrapping: SGD favors minima with smaller variance of stochastic gradient. 1. The top example in Section 2 is attractive and illustrative. 1. The presentation of the theoretical part is a bit confusing. - The theoretical results are listed as Lemmas 1 and 2 as well as Proposition 1, without a theorem that usually serves as the center of discussions. This makes me confused about what is the main theoretical contribution of the paper. - The discussions after Lemmas 1 and 2 mainly discuss why the lemmas hold, and do not actually help with the understanding of the theoretical results (especially for Lemma 2, whose righthand side has a lot of terms). 2. My understanding is that the core of the theoretical analysis is the correspondence of Equations (6) and (7) with Equations (10) and (11), which provides a viewpoint from the implicit regularization of SGD by "bootstrapping" the gradients. However, this part lacks a comparison against GD or noisy GD. 3. According to my understanding, the technical contribution is minor. Lemmas 1 and 2 are basically Taylor expansion, and Proposition 1 is basically the strong law of large numbers. I would honestly confess that I do not understand all the details of the paper, and would be happy to discuss with the authors, other reviewers and the AC. My score of 2 currently represents my unconfident understanding. i think the intuition of the paper is good, but the theoretical part may need improvements. 1. Can the authors show more details of the algorithm SGDwReg2, especially how to estimate the term Reg2? - If Reg2 is estimated in an exact way, then SGDwReg2 requires knowledge of the entire dataset at each minibatch update. In this case, is it possible to design an adaptation of SGD that incorporates the idea of SGDwReg2 but without the requirement of the entire dataset? - If Reg2 is approximated, can the authors show the details of approximation? 2. How does the bootstrapping view compare with the idea of variance-reduction techniques like SVRG? Fully human-written
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents a theoretical framework for understanding the generalization properties of Stochastic Gradient Descent (SGD). The authors decompose the generalization gap and introduce the concept of "algorithmic variability", which they analyze through the lens of statistical bootstrapping. Based on this decomposition, the authors construct two novel regularizers and empirically validate that their inclusion can lead to improved generalization performance on tasks including sparse regression and neural network training. However, there are still some concerns to me. Therefore I lean to a rejection at the time being. Specifically, I am not sure whether the idea in this paper has significant differences to algorithm stability, and whether the derivation of this paper is meaningful. See below for more details. 1. The paper posits that SGD uses the gradient variability (caused by mini-batch sampling) as a "bootstrap estimate. 2. This paper proves that the expected generalization gap is determined by the trace of the product of the solution's Hessian matrix and the "algorithmic variability" matrix. 3. This paper designs a new regularizer based on the theoretical findings. 4. The authors further provide empirical evidence on this regularizer. 1. [Major Concern] It seems that Assumption 2 directly leads to a small Variability (Eqn 3). However, the authors did not discuss it much. If so, I cannot be convinced that Eqn (3) is the dominate term compared to Eqn (4), where Eqn (4) also contains the epsilon[2, T] term. 2. [Major Concern] I am not convinced that this paper has significant differences with the line of algorithmic stability. The authors claim in Line 466 that "this paper considers "Hessian-weighted and evaluated at the solutions"". It seems that algorithmic stability can include this case with pretty minor changes. Due to the simplicity, algorithm stability just bound the Hession with smoothness, and use iteration to reach the solution. But starting from the definition of algorithm stability, these are not necessary. The authors shold provide more evidence on how this paper performs differently with algorithm stability. [Minor] 1. The authors claim that "we prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability." According to the paper's derivation, the algorithmic variability is bounded by two components (corresponding to the latter term in Eq. 6 and Eq. 7). While the authors convincingly connect the implicit regularization of SGD, as identified by Smith et al. (2021), to the first component (Eq. 6), they do not provide evidence or argumentation that SGD also implicitly regularizes the second component (Eq. 7). Consequently, the claim that SGD "controls the algorithmic variability" in its entirety appears to be an overstatement. This significantly limits their theoretical contribution, as the work seems to demonstrate that vanilla SGD only addresses a part of the problem identified by the authors. 2. The paper's analysis of the proposed regularizers, Reg1 and Reg2, lacks sufficient depth regarding their interplay and individual utility. For instance, given that the authors identify Reg1 as an existing *implicit* regularizer of SGD, a crucial discussion is missing on the utility of its *explicit* inclusion. What is the tangible difference between applying Reg1 explicitly versus relying on its implicit effect? Would applying only Reg2, which is the component not addressed by vanilla SGD, be a more practical and principled approach? The paper would be substantially strengthened by ablation studies that dissect the individual contributions of Reg1 and Reg2 and clarify their roles in guiding SGD towards better-generalizing solutions. 3. The practical significance of this work is severely hampered by the unaddressed computational overhead of the proposed regularizers. Both Reg1 and Reg2, as defined, require the computation of the full-batch gradient at each training step. This is a prohibitive cost for large-scale datasets and fundamentally contradicts the core philosophy of SGD, which is designed precisely to avoid such computations. The absence of any discussion on this issue, or on potential efficient approximations, makes it difficult to assess the empirical value of the proposed method. As it stands, the practical guidance offered by the paper appears limited. See above. Fully human-written
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to provide a novel eccplanation for the superior genetalozation property of SGD compared wirh GD, from a boostrap perspctive. Specifically, under certein assumptions, the authors show that the generalization error can be decomposed into a dominant Hessian=-preconditioned algorithmic variability term and several small terms. They further argue that the algorithmic variavbilit is stronhly correlated to the accumulated empirical covriance of gradients. As a consequence, they empirically estalish that SGD regularizes algorithmic variability as a bootstrap estimate, and hence improving the generalization error through this correlation. This paper is clearly written and has a nice structure. Although the authors provide an upper bound on the generalization error via algorithmic stability, the paper does not explicitly establish how SGD regularizes this term theoretically. Moreover, there is no theoretical characterization of the generalization gap between SGD and GD. Another concern arises from the assumptions: while Assumption 1 appears standard, Assumption 2 is rather demanding and may not hold in many scenarios: existing theoretical results generally suggest that the upper bound on uniform algorithmic stability grows with the number of iterations. This implies that the bias term, rather than variance, often dominates the generalization error. From this perspective, the argument that “SGD generalizes better because it regularizes the gradient variance” may not be entirely convincing. No further questions. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next