ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (20%) 6.00 3.00 2429
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 4 (80%) 3.00 3.00 4556
Total 5 (100%) 3.60 3.00 4130
Title Ratings Review Text EditLens Prediction
Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper addresses the theory–practice gap that arises because modern learning‑rate (LR) schedulers used in deep learning are non‑monotone, while classical analyses assume constant/decaying LR. Proposed method: a probabilistic LR scheduler (PLRS) that samples the LR each step from [Lmin, Lmax], equivalently constant‑step GD with zero‑mean noise whose scheduler component is multiplicative. Under standard smoothness/strict‑saddle/local‑convexity assumptions, the analysis provides three results: expected descent for large gradients, finite‑time escape from strict saddles, and stability near local minima over a time window. Empirically, PLRS is validated across vision (CIFAR‑10/100, Tiny‑ImageNet, ImageNet‑1K), NLP (SQuAD v1.1, IWSLT’14), and ASR (Whisper‑small on CommonVoice Hindi), alongside cosine/one‑cycle/knee/constant/multi‑step. 1. Drop‑in simplicity: Two bounds (Lmin, Lmax), no handcrafted schedule shapes, selection via the standard LR range test. 2. Stability‑focused diagnostics: Loss‑curve plots, (Lmin, Lmax) sweeps under a single explicit recipe, interpretable stability signals. 3. Theoretical advance for non‑monotone LR: Under standard assumptions, unified results for large‑gradient descent, finite‑time saddle escape, and local‑minimum stability. 1. Representativeness/baselines: ImageNet‑1K uses 60 epochs and disables momentum/weight decay; the main table there compares only to knee under that recipe. 2. Hard to operationalize: Conditions like Lmax < 1/β and the analysis assumption ut ⟂ g(xt); Õ(·) hides constants, limiting concrete guidance. 3. Ablations/reporting gaps: No systematic tests of sampling distribution/frequency or optimizer interactions; DenseNet‑40‑12/40‑10 label mismatch; “Baseline” unlabeled in Tables 4–5; ASR std ≈ 0.0002 needs verification. 4. Missing important related studies, such as [1] and [2]. [1] Smith, Samuel L., et al. "Don't decay the learning rate, increase the batch size." arXiv preprint arXiv:1711.00489 (2017). [2] Wu, Yanzhao, et al. "Demystifying learning rate policies for high accuracy training of deep neural networks." 2019 IEEE International conference on big data (Big Data). IEEE, 2019. 1. Beyond the range test, what conservative, repeatable rule‑of‑thumb (and instability signals/fallback) do you recommend for picking bounds (Lmin, Lmax)? 2. Will you add results and default ranges for SGD+momentum+WD / AdamW+WD under mainstream schedules? Heavily AI-edited
Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a Probabilistic Learning Rate Scheduler (PLRS): a randomized learning rate scheduler that samples the learning rate from the interval $U[L_{min}, L_{max}]$, where $L_{min}$ and $L_{max}$ are user-specified constants. The authors prove convergence with SGD under several assumptions including smoothness, Lipschitzness of the Hessian, and boundedness of the function. They test their approach empirically on image classification (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet-1k), question answering (SQuAD v1.1), machine translation (IWSLT'14), and speech recognition (CommonVoice 11.0) tasks, comparing to cosine annealing, one-cycle scheduler, knee, multi-step, and constant leanring rates. The main selling point, as presented by the authors, is the fact that their analysis allows for non-decreasing learning rate schedules. However, I went through the proofs and I see little novelty in the theory. 1. The idea is simple and is presented clearly. 2. The propoesd random scheduler is eay to implement. 1. First of all, the proposed analysis, which is the main contribution of the paper, is not really new. The authors use the standard descent lemma and simply rely on the assumption that the gradients are large enough to cancel out the noise term. Besides, while the previous papers mostly used decreasing learning rates it's not really required. The descent lemma already gives $\min_{t\le T}\mathbb{E}[\Vert \nabla f(x\_t)\Vert^2] \le \frac{1}{\sum_{t=0}^T \eta\_t }\sum_{t=0}^T \eta\_t \mathbb{E}[\Vert \nabla f(x\_t)\Vert^2] \le f(x_0) - f_* + \sigma^2 \sum_{t=0}^T \eta\_t^2 $. Obviously, this analysis doesn't require the stepsizes to be decreasing. 2. I'm a bit surprised that the authors decided to study cyclical learning rates as they seem to have been falling out of fashion. For instance, the cited paper of Smith (2017) has been receiving less citations as can be seen on Google Scholar. It's quite well known (look at almost any paper training state-of-the-art models) that the most widely used schedulers are warmup+stable (optional)+decay (usually cosine, linear, sometimes inverse square root). 3. The numerical results are clearly not matching state of the art. For instance, on ImageNet, the authors achieve 68.01% top-1 test accuracy, whereas one can achieve as much as 79.05%, see Table 5 in (Xi et al., "Unsupervised Data Augmentation for Consistency Training", 2020). 4. The question of novelty is also quite open here. For instance, randomized scaling of the update has been considered by Zhang and Cutkosky, "Random Scaling and Momentum for Non-smooth Non-convex Optimization", (2024). #### Minor I think there is a typo in the proof of Theorem 1 in equation (13): it should be $\Vert\nabla f(x\_t)\Vert^2$ rather than $\Vert\nabla f(x\_0)\Vert^2$. Where in the proofs you show that using a random learning is actually helping with the upper bounds? Fully human-written
Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a probabilistic learning rate scheduler (PLRS) for SGD: at each step, the LR is sampled uniformly from the interval $[L_{\min}, L_{\max}]$ and decomposed as $\eta_t = \eta_c + u_t$ where $\eta_c$ is just the avg of $L_{\min}$ and $L_{\max}$. This allows us to interpret the update as GD with multiplicative noise. The authors prove: (i) expected descent when (|\nabla f|) is large (Theorem 1), (ii) high‑probability escape from strict saddles within (T=\tilde O(L_{\max}^{-1/4})) steps (Theorem 2), and (iii) stability near a local minimum for (T=\tilde O(L_{\max}^{-2}\log(1/\xi))) steps (Theorem 3), under standard L-smoothness, Hessian‑Lipschitz, bounded gradient noise, the strict‑saddle property, boundedness of function value, and local strong convexity assumptions. The authors conduct small-scale experiments on CIFAR10/Tiny‑ImageNet/ImageNet, SQuAD, IWSLT’14, and CommonVoice that suggest PLRS might be competitive with cosine, one‑cycle, knee, multi‑step, and constant schedules. 1. The paper's theoretical analysis is written clearly and organized around the three regimes ((large gradient; small gradient/negative curvature; neighborhood of a local minimum). 2. Empirical section is broad, covering vision, NLP, and ASR, with reasonable baselines and ablations on $L_{\min}, L_{\max}$. 3. The problem of learning rate scheduling has attracted a lot of attention recently and is quite well-motivated, e.g. as in Defazio et al. 2024. Defazio, A., Yang, X., Mehta, H., Mishchenko, K., Khaled, A., & Cutkosky, A. (2024). The road less scheduled. Advances in Neural Information Processing Systems, 37, 9974-10007. 1. The paper states that although cyclic LRs lack theory, they were “shown to be a valid hypothesis owing to the presence of many saddle points (Dauphin et al., 2014).” but as far as I can tell Dauphin et al. identify saddle‑point pathology but they do not validate the cyclic‑LR hypothesis. 2. The paper claims to be “the first to theoretically prove convergence of SGD with a LR scheduler that does not conform to constant or monotonically decreasing rates” (Sec. 1.2). However, non‑monotone step‑size policies do have some prior theory! For example, the Distance over Gradients (DoG) algorithm of (Ivgi et al., 2023) both increases and decreases the learning rate. 3. Multiplicative-noise SGD has also been analyzed in existing work (Chen et al., 2025; Jofré & Thompson, 2019, Faw et al., 2023). Beyond establishing saddle point escape, I'm not really sure what's particularly difficult in extending prior analysis here. 4. The manuscript refers to “cosine‑based cyclic LR scheduler”. But the standard cosine schedule (even with warm restarts) is a decay one and not really cyclic.. Warm restarts change the phase and amplitude, so successive “cycles” do not repeat at the same magnitude (Loshchilov & Hutter 2017). 5. There is a big difference between "flactuating" learning rates in practice (triangular, or cyclical) and the i.i.d. learning rates drawn from a uniform distribution here. The paper mentions "trustworthy AI" as a motivation and to not treat optimizers as black boxes, but I am not exactly sure how learning rates drawn uniformly from a certain range help do that, especially when we *do* have analysis for why some learning rate schedules work well already and a theoretically-motivated adaptive learning rate scheme that can increase the lr (Defazio et al., 2023). Refs: Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems (NeurIPS 27), 2933–2941. Ivgi, M., Hinder, O., & Carmon, Y. (2023). DoG is SGD’s Best Friend: A Parameter‑Free Dynamic Step Size Schedule. arXiv:2302.12022. Chen, Z., Maguluri, S. T., & Zubeldia, M. (2025). Concentration of Contractive Stochastic Approximation: Additive and Multiplicative Noise. The Annals of Applied Probability, 35(2), 1298–1352. https://doi.org/10.1214/24-AAP2143 Jofré, A., & Thompson, P. (2019). On variance reduction for stochastic smooth convex optimization with multiplicative noise. Mathematical Programming, 174(1), 253–292. https://doi.org/10.1007/s10107-018-1297-x Faw, M., Rout, L., Caramanis, C., & Shakkottai, S. (2023). Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD. Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023), PMLR 195:1–72. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. International Conference on Learning Representations (ICLR 2017). (arXiv:1608.03983). Defazio, A., Cutkosky, A., Mehta, H., & Mishchenko, K. (2023). Optimal linear decay learning rate schedules and further refinements. arXiv preprint arXiv:2310.07831. 1. Please address my concerns in the weaknesses section. 2. Beyond tracking the multiplicative updates, what is the main technical novelty here compared to Jin et al. (2017)? 3. Your introduction emphasizes “fluctuating” learning rates in practice (triangular, one‑cycle, cosine), but the proofs assume IID LR draws in $[L_{\min},L_{\max}$. What breaks if the LR is flacuated according to a pre-set schedule or if the draws are correlated? Refs: Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., & Jordan, M. I. (2017). How to Escape Saddle Points Efficiently. Proceedings of the 34th International Conference on Machine Learning (ICML 2017), PMLR 70:1724–1732. (arXiv:1703.00887). Fully human-written
Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Naive/stochastic gradient descent uses a fixed learning rate eta. Convergence to a local minimum is not guaranteed with a fixed learning rate. If the learning rate is too small, naive/stochastic gradient descent can get stuck in stationary points that are not local minima. If the learning rate is too large, convergence is not guaranteed. Typical approaches to address this are learning rate schedules that alternate between large and small values for eta. Past methods are deterministic. This paper proposes a stochastic method where the learning rate is sampled from a uniform distribution in [l, h]. Conditions are given, along with a proof, where convergence to a local minimum is guaranteed. Experiments are provided giving empirical evidence of the value of the new method. This is outside my area of expertise. Thus, I am not familiar with the literature on the topic. I leave it to other reviewers to assess whether this is novel in the context of relevant literature. I am also not a math wizard. Thus, I did not slog through the proofs. I don't believe that is necessary given my comments below. Nonetheless, I trust the authors on the validity of the proofs. If there are errors (and I don't know whether there are or claim that there are), they must be fixable, as I believe that the general intuition is sound. I leave it to others to check the proofs. I believe that the general intuition/idea of randomly choosing eta, instead of having a deterministic schedule, can lead to convergence guarantees. Generally, randomized algorithms have demonstrated similar benefits over deterministic algorithms for many problems. The field of deep learning has lots of aspects that are black art, where people use general intuition to solve problems, rather than methods with provable properties. Choosing the optimization method and its associate hyperparameters is one of them. The field would greatly benefit from replacing this black art with methods that offer guarantees . Thus, I like and encourage this work, subject to my concerns below. 1. This is reminiscent to simulated annealing in many ways. While simulated annealing attempts to find global optima and this is to provide convergence guarantees to local optima, they are otherwise similar. This should be discussed and relevant results brought to bear. 2. Why sample eta from a uniform distribution? It would seem to me that, generally, to guarantee convergence, you want an eta at the low end of the distribution. You only need it rarely at the high end to avoid stationary points that are not local minima. So why not sample from a long tailed monotonically decreasing distribution, such as an exponential? That would seem to me to be sufficient to guarantee convergence and may be more general and faster. An interesting question would be what kind of distribution is necessary to guarantee convergence and what the relationship would be between that distribution and speed of convergence. 3. All optimization methods have some hyperparameters. This one too. For any method (this one too), convergence is guaranteed (and in fact only happens) only for certain hyperparameter values. So, from a practical perspective, no advantage is demonstrated of this method over any other. What one needs to show is some empirical evidence of the following form: it is easier or less error prone to select the hyperparameters for this approach than any others. If a nonexpert user arbitrarily selects l and h are they more likely to just stumble upon ones that lead to (fast) convergence. Are there easy rules of thumb that one can use to select l and h that lead to (fast) convergence. 4. The empirical experiments don't add much to the paper without (3). 5. In the black art of deep learning, convergence is not necessarily important or even desirable. Things like early stopping explicitly choose to not seek convergence in order to avoid overfitting and get better generalization. What you really want is some sort of guarantee on solving the underlying problem. None. Fully human-written
Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a randomized learning rate (LR) scheduler that modulates (multiplies) learning rates by uniform random noise in order to improve convergence in theory and practice. The paper consists of a theoretical part, where convergence properties are examined under assumptions on smoothness, bounded values and other analytical assumptions, and a practical part, where experiments are conducted that compare the generalization performance reached after a certain amount of training to competing LR-scheduling schemes (such as the famous 1-cycle). The experiments show favorable results for the new method, as it mostly outperforms the competition. The paper proposes a new LR-scheduling method that is easy to implement and seems to provide some practical gains over commonly practice scheduling techniques, which is impressive given the overall amount of research that has been conducted to put these schemes into popularity. As far as I was able to check, the results reported for the "smaller" benchmarks like CIFAR-X are close to actual SOTA for the architectures (for ImageNet, I am not sure, see below). The paper also makes some effort to provide theoretical justifications for convergence guarantees (in particular, concerning avoidance of saddle points). The method also seems to have some merits from an intuitive point of view: Noisy gradients can help with generalization in multiple ways (batch noise creates drift away from non-generalizing local minima, gradient noise can affect margins), so, at least when looked at it from the surface, randomizing step sizes could help in some way. I should first say that I have not actively worked on convergence analysis for any such numerical scheme; so my opinion is not an expert one. That said, I found the theoretical part less convincing than the practical results. First of all, a number of strong assumptions have to be made, many of which are not met in the practical examples (ReLU networks employed in VGG or ResNet-Variants are only continuous but do not have any higher-order smoothness properties, and up to the classification layers, rectifier networks are not bounded) It is also my understanding that practical networks are not strongly convex in the stated sense for typical training data but contain degnerate "valleys" of constant, locally optimal training loss (which seems to be somewhat related to why they generalize so well). I would assume that stronger assumptions are just a common modeling tool used in related literature; so this might still be a useful modeling assumption even if not fully realistic. In that case, it would be nice to position this pronouncedly as such. Then, in the proof sketches (I have not tried to verify the actual proofs in the appendix), it seems to me that these assumptions play a critical role; in particular, in my (maybe superficial understanding) saddle points are escaped because if there is one direction left with an escape gradient, smoothness transfers this into the neighborhood of other points and directions (otherwise, the Hessian-level smoothness would not hold). However, this mechanism could and would not work at all in a non-smooth network with ReLUs; so the model seems to be insufficient to explain the practical success of the method. I was also not able to see where the main idea - randomization - actually brings in its benefits; wouldn't the same arguments also hold for a deterministic scheme with a small but fixed LR? I would not be surprised that my understaning ist rather incomplete due to limited prior knowledge on these kind of issues, but it would be very helpful to end Section 3 (Theory) with one or two paragraphs tying the theoretical models more closely to effects that one would expect to see and help in a practical scenario. Finally, concerning the theory part, I had also some issues with the formulation of some of the definitions. It seems to me that some of those only make sense with some implicit assumptions added that are not stated in the text: Definition 4 seems to define a saddle point as a point with a negative eigenvalue in the Hessian (due to the constants being arbitrary, no other restrictions would apply for a twice-differentiable function). I would guess that the existence of positive eigenvalues is implied, and constants will later be looked at and judged. Definition 6: f posses the strict saddle property at all x if x fullfills... Probably that should be f? But then, how can x be close to a local minimum when we are talking about the whole function? I could not find an interpretation of the text that seemed useful to me. In the practical part, I found the results encouraging, but I was a bit surprised about the ImageNet 1-K performance: The original 2016 ResNet Paper by He reports 5.25% top-5 and 20.74% top-1 error, clearly better than Knee and PLRS. While the 2016 paper might have used more epochs, it is still worth discussing SOTA for ImageNet (I consider this benchmark important as it is more realistic; obviously, computational costs might be prohibitive for parameter tuning so that a lower-level baseline is all one can get, but this should be discussed/explained). Overall, there seems to be some potential, and the parts I am critical of are in an area where I am not very experienced; due to open questions on my side I would for now maintain a slightly skeptical rating at rather low confidence. I had some questions about the theoretical part (see above). The most important aspect for me would be to understand more clearly "why" the method should work. Can you connect the formal analysis to actual advantages in a (realistic model of a) practical scenario? In particular, why is the specific type of randomization so useful that it can beat all the other methods? Fully human-written
PreviousPage 1 of 1 (5 total rows)Next