ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 2833
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 2.00 3.00 1815
Fully human-written 2 (50%) 3.00 4.00 6832
Total 4 (100%) 3.00 3.50 4578
Title Ratings Review Text EditLens Prediction
Simple Stepsizes for Quasi-Newton Methods with Global Convergence Guarantees Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces a new stepsize scheme for Quasi-Newton (QN) methods—Cubically Enhanced Quasi-Newton (CEQN)—which yields global non-asymptotic convergence guarantees for convex optimization problems. Classical QN methods (like BFGS and SR1) are efficient locally but lack explicit global convergence rates without line search or strong convexity assumptions. CEQN bridges this gap by: 1. Deriving a simple analytic stepsize from a cubic-regularization viewpoint. 2. Achieving O(1/k) global convergence rate and O(1/k²) under controlled Hessian inexactness. 3. Introducing an adaptive variant that automatically adjusts to Hessian approximation quality. 4. Demonstrating superior empirical performance over standard QN and cubic-regularized baselines on logistic regression tasks. This paper cleverly connects cubic regularization and QN updates through a unified stepsize formula, which is a theoretical contribution. This expression mirrors the implicit damping of cubic regularization while retaining the simplicity of standard QN updates. The authors prove explicit O(1/k²) rates under verifiable conditions on Hessian inexactness—an improvement over prior QN analyses, which often only show asymptotic convergence. This paper connect the quasi-Newton method's property of the affine-invariant analysis. The affine-invariance property is preserved both algorithmically and theoretically, aligning the work with Newton-type methods’ geometric robustness. The authors also proposed two adaptive mechanism: 1. the adaptive CEQN (Algorithm 2) adjusts the inexactness level αₖ dynamically, maintaining global convergence while avoiding manual stepsize tuning. 2. the practical variant (Algorithm 3) introduces both increasing and decreasing mechanisms, balancing theoretical rigor and empirical efficiency. The authors also conduct numerical experiments on a9a and real-sim logistic regression showing clear advantages in both iteration count and wall-clock time. The adaptive CEQN variants outperform classical LSR1 and cubic QN in gradient convergence and loss reduction. The empirical results are consistent with the theoretical analysis. There are three major weaknesses: 1. the step size depends on the constant L, which is dependent on the parameters of the objective function. These parameters are in general hard to estimate. Therefore, this make this step size difficult to implement in practice. 2. The assumption 2 is too strong. In general, we can't assume that the Hessian approximation matrix is close to the exact Hessian. This should be proved. Moreover, the inexact Hessian parameters need to be estimated with explicit lower and upper bounds. And the final convergence rates need to be dependent on these parameters. 3. The most important property of the quasi-Newton method is the superlinear convergence rate. The authors need to prove that with the additional strong convexity assumption, the proposed method can also achieve the superlinear convergence. Otherwise, there is no advantages over the gradient and accelerated gradient method. These methods also reach the rate of 1/k and 1/k^2 and the computational cost per iteration of these methods are smaller than quasi-Newton method. Without the superlinear convergence analysis, the proposed method didn't have advantages over the first-order methods. There are no other questions. Fully human-written
Simple Stepsizes for Quasi-Newton Methods with Global Convergence Guarantees Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new family of Quasi-Newton methods that combine cubic regularization and affine-invariant geometry to achieve global convergence guarantees. The core contribution is a simple explicit stepsize rule derived from the Cubically Enhanced Quasi-Newton (CEQN) framework, which yields non-asymptotic global rates of O(1/k) and, under controlled inexactness, accelerated rates of O(1/k2). The authors further develop an adaptive variant that automatically adjusts to the local level of Hessian accuracy and introduce a practical version (Algorithm 3) that allows both increasing and decreasing the regularization parameter. The paper provides theoretical analysis under the assumption of semi-strong self-concordance, along with experiments on logistic regression tasks comparing CEQN to standard quasi-Newton and cubic-regularized baselines. 1. The analysis is carefully derived and builds on recent developments in affine-invariant Newton and quasi-Newton methods. The results are technically sound and extend prior work on cubic regularization. 2. The paper is well organized, with a logical flow from regularization-based motivation to explicit stepsize derivation and convergence proofs. 3. The explicit rule derived from the cubic model is elegant and provides a bridge between adaptive damping and regularized Newton schemes. 4. The practical adaptive scheme (Algorithm 3) is appealing; the mechanism for increasing and decreasing the inexactness level improves robustness compared to prior purely monotone approaches. 5. The experiments, though limited in scale, demonstrate consistent improvements over standard and cubic quasi-Newton baselines, validating the practical potential of the proposed stepsize. 1. The theoretical guarantees rely on semi-strong self-concordance, which, although broader than strong convexity, still excludes many standard objectives (e.g., generic smooth convex or non-smooth losses). The discussion would benefit from clearer intuition or examples showing when this assumption holds in practice.3 2. Experiments are restricted to small-scale convex problems. It would be valuable to see whether the method remains competitive in larger or mildly nonconvex settings. 3. The adaptive scheme shares structural similarities with inexact cubic-regularization methods (e.g., Kamzolov et al., 2023), and the novelty primarily lies in the affine-invariant reformulation and explicit stepsize rule rather than in a fundamentally new algorithmic idea. 4. Finally, although the stepsize rule is simple, the iteration complexity required to reach a target accuracy and the practical conditions under which the inexactness level can be sufficiently reduced are not fully characterized. This limits the interpretability of the convergence guarantees in practical settings. See weaknesses Fully AI-generated
Simple Stepsizes for Quasi-Newton Methods with Global Convergence Guarantees Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The work proposed an affine invariant quasi-newton algorithm with carefully chosen stepsizes for optimizing convex, semi-strongly self-concordant functions. The algorithm converges globally with rate $O(1/k^2)$ initially and transitions to $O(1/k)$. One can keep the global $O(1/k^2)$ rate if Hessian inexactness is small along the course of the algorithm. An adaptive version of the algorithm where the parameter controlling the Hessian inexactness is updated in each iteration is also proposed, and is shown to have the same global rate. Numerical validation is done against other quasi newton methods. However, the reviewer has several major concerns on validity of the theoretical claims and practicality of the proposed algorithm, which seem to make the paper not clearing the ICLR bar for acceptance. 1. The presentation in the manuscript is well-structured. 2. The explanation for the choice of stepsize in section 2 is very clear. 3. The literature on recent theoratical development of quasi-newton method is comprehensive. 4. The work generalizes a recent work, Affine-Invariant Cubic Newton (AICN), in allowing inexact Hessian, which makes the newly propose CEQN more versitile 5. The derivation of the stepsize rule motivated by Qubic-regularization using approximate Hessian is very nice. 1. **Lower bound in Assumption 2** The paper concerns non-strictly convex objective $f$ and assumption 2 states that the Hessian $\nabla^2 f(x)$ can be under- and over-approximated by a PSD matrix $B_x$. Either the under-approximation assumption must trivialize with $\underbar{\alpha}=1$ or one must allow PSD Hessian approximation matrix $B_{x}$. 2. **On the Practical Verifiability of Assumption 2** The entire convergence analysis of the proposed method hinges on Assumption 2, the relative inexactness condition $(1-\underline{\alpha})B_{x} \le \nabla^{2}f(x) \le (1+\overline{\alpha})B_{x}$. The resulting convergence rates in Theorem 1 depend directly on the inexactness parameters $\underline{\alpha}$ and $\overline{\alpha}$. However, the paper provides no practical, computationally feasible method to construct or verify an approximation $B_x$ that is guaranteed to satisfy this assumption for known (or even bounded) values of $\underline{\alpha}$ and $\overline{\alpha}$. While the experiments employ standard methods like L-BFGS, these are recursive methods that do not, in general, come with a priori guarantees on their spectral relationship to the true Hessian $\nabla^{2}f(x)$. It is unclear how these practical methods connect back to the foundational assumption of the theory. This creates a significant gap between the theory and practice. For instance, the parameter selection for the non-adaptive algorithm (e.g., $\theta \ge 1+\overline{\alpha}$) is impossible to implement without a priori knowledge of $\overline{\alpha}$. The adaptive method attempts to find this level, but it does so by reacting to a failed condition rather than by verifying the assumption itself. The analysis lacks a crucial component: a practical algorithm to compute a $B_x$ that demonstrably satisfies the conditions upon which the theory is built. The authors do discuss in the appendix (L1359) that one may satisfy such assumption if $B_k$ is chosen as the exact hessian at a point near $x_k$, but assuming that we can compute the exact Hessian defies the purpose of Newton-type or quasi-Newton method. 3. **Trajectory-dependent constants** This is the biggest concern that the reviewer has on the theoretical contribution of this work. The analysis in Theorem 1 presents a convergence rate that depends on a quantity $D=\max_{k\in[0;K+1]} ||x_{k}-x_{*}||_{B_k}$. This makes the stated convergence bound circular. A standard, a priori complexity bound should be expressed in terms of initial problem parameters (e.g., $||x_0 - x_*||$ and function properties) and should allow one to predict the number of iterations $K$ required to achieve a target accuracy $\epsilon$. As written, the bound in Theorem 1 cannot be used for this purpose, since the value of $D$ is unknown until after the algorithm has already run for $K$ steps. While Lemma 2 establishes a valuable descent property on the function value, this alone is insufficient to guarantee a uniform (i.e., $K$-independent) bound on $D$, especially as the norm $|| \cdot ||_{B_k}$ changes at each iteration. This reliance on a posteriori quantities seems to be a foundational part of the analysis, as it also appears in the definition of $\overline{D}$ for the adaptive method in Theorem 2 4 and again as $R$ in the alternative analysis of Theorem 5. The theoretical claims would be significantly strengthened if the authors could provide true a priori bounds based on a more standard analysis, such as by assuming a bounded initial level set. 4. **On the Condition for the $O(k^{-2})$ Rate (Corollary 2)** The paper claims that an $\mathcal{O}(k^{-2})$ rate is achievable if the inexactness satisfies the condition $\alpha_k \le L||x_{k+1}-x_{k}||_{B_k}$. However, this condition is inherently a posteriori. It cannot be guaranteed before taking a step, as the right-hand side of the inequality depends on the next iterate $x_{k+1}$, which is itself a result of the step taken. The choice of $\alpha_k$ influences the step, which in turn determines if the condition is met.Therefore, one cannot enforce this condition to guarantee the accelerated rate. One can only check after the fact if the condition was satisfied at each iteration. The paper itself notes that the condition is "verifiable in practice"2, which reinforces this point. As such, Corollary 2 does not provide a constructive method to achieve the $\mathcal{O}(k^{-2})$ rate, but rather identifies a specific, non-enforceable scenario under which it occurs. 5. **QN vs. Newton-type** The authors refer to their proposed method as a "Quasi-Newton" (QN) method. While the term is used broadly, in optimization literature, "Quasi-Newton" most often refers to a specific class of methods (like BFGS, DFP, or SR1) where the Hessian approximation $B_k$ (or its inverse $H_k$) is constructed recursively using iterate and gradient differences, satisfying a secant equation. It would be more appropriate to name the proposed method as "Newton-type" rather than QN. I was initially excited that this work would have established the global convergence guarantee for QN methods in the literature, but the proposed algorithm was rather of a Newton-type. 6. **Comparison to Existing Work** In the introduction, the authors state that global convergence rate for Newton-type or quasi-newton method is not known in the literature --- "Despite all of the interest, even nowadays, many classical Quasi-Newton methods still lack non-asymptotic global convergence guarantees. Only recently global non-asymptotic convergence guarantees with explicit rates were established for BFGS in the strongly convex setting for specific line search procedures". Their primary goal is "to guarantee global non-asymptotic convergence guarantees for classical Quasi-Newton methods for non-strongly convex functions." Precisely this has been done in a recent work (Duan, Lyu 2025; arXiv:2509.21684) on a different but related algorithm, which relaxes Mishchenko's globally regularized Newton (Mishchenko, 2021; arXiv:2112.02089) with overestimated Hessian approximation with also a very simple stepsize rule. This work provides some practical randomized Hessian overestimation procedure, obtains $O(1/k)\rightarrow O(1/k^2)$ global convergence rate (similar to Thm. 1), and local convergence analysis in rank-deficient landscape. Another related work is (Doikov, Stich, Jaggi, 2024; arxiv: 2402.04843), where stronger convervgence guarantees are provided assuming some global knowledge of Hessian spectrum. Certainly the author's goal has been considered in the literature and very relevant results are already obtained. The author's should discuss how their result compares to this work and modify the introduction accordingly. 7. **Notational inconsistencies** Some inconsistancies in notations. For example, it seems to me that some of the $L$'s between line 163 and 170 should be $L_2$ instead, and $\eta$ on line 201 should be $\eta_k$. At line 116, given fixed $\mathbf{B}$, one defines the operator norm of another matrix $\mathbf{A}$ as $$\|\mathbf{A}\|_{op}=\sup_{y\in\mathbb{R^d}}\frac{\|\mathbf{A}y\|_{x}^{*}}{\|y\|_x},$$ but the right hand side does not depend on $\mathbf{B}.$ Maybe the author(s) want to define the operator norm for a fixed reference point $x$ instead? 7. **Lack of local convergence analysis.** One benefit of Newton-type methods is that one can still have superlinear local convergence without processing the full Hessian. It seems unlikely that (adapted) CEQN can have superlinear local convergence since it is using Hessian approximation from limited memory and the stepsize is not going to 1. In this case, it seems the benefit of using CEQN versus AGD is unclear, given that AGD also have global rate $O(1/k^2).$ **Minor comments:** L108: holds --> it holds that L198: an stepsise --> a stepwise L201: \eta-->\eta_k L201: The exposition here is rough. The expression $\Vert h_k \Vert_{B_k}=\eta_k \Vert \nabla f(x) \Vert _{B_k}$ is plugged back to the definition of $\eta_k$ to yield the quadratic equation for $\eta_k$. 1. Can author(s) clarify the definition of the operator norm of a matrix (L115): whether it is with respect to a fix point $x$ or a fix different matrix $B$? Is it a typo that the norms in the definition there should be subscripted with $B$? And in either case, what is the fixed point (or matrix) when the operator norm is envoked in Assumption 1? 2. In Theorem 1, can author(s) upper bound $D$ by some quantity independent of $K$? Otherwise, it is not transparent that the bound in Theorem 1 gives the desired non-asympototic global rate $O(1/k^2)$ or $O(1/k)$. Similarly for the quantity $\overline{D}$ and $\alpha_K$ in Theorem 2. 3. Can author(s) provide some numerical experiments of (adaped)CEQN against AGD? I also wonder how much does allowing inexact Hessian help with speeding up the computation. Can author(s) also provide some experiments against AICN (https://arxiv.org/pdf/2211.00140)? Fully human-written
Simple Stepsizes for Quasi-Newton Methods with Global Convergence Guarantees Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a stepsize schedule that guarantees global convergence of the quasi-Newton methods for the general convex problem. Their method achieves a convergence rate of $\mathcal{O}(1/k^2)$, which matches that of the accelerated gradient method. 1. This paper proposes a simple stepsize schedule for updating the quasi-Newton iteration inspired by the cubic regularized Newton method, and it achieves global convergence for the convex problem. 2. They provide theoretical justification and empirical evidence for their proposed methods. 1. The main limitation of the proposed method appears to be Assumption 2, which quantifies the inexactness of the approximate Hessian with some predefined parameters $\bar{\alpha}$ and $\underset{\bar{}}{\alpha}$. However, such parameters can be quantified in existing literature (Lemma 4.1 in [1]) in the analysis of local superlinear convergence of the quasi-Newton methods for the strongly convex problems. 2. The proposed method achieves a convergence rate of $\mathcal{O}(1/k^2)$, which is comparable to that of the accelerated gradient method. However, the accelerated method is considerably simpler and does not require any (inverse) Hessian information. Therefore, the advantages of the proposed method remain unclear. 3. The name of the adaptive scheme seems misleading, as this stepsize schedule is more like a line search method. 4. There are some typos in the manuscript: - Line 135 and 139, I think it is $f(y)$ instead of $f(x)$ - In the denominator of (6), should the second "+" be "-"? **References** - [1] Rodomanov, A., & Nesterov, Y. (2022). Rates of superlinear convergence for classical quasi-Newton methods. Mathematical Programming, 194(1), 159-190. 1. Can the analysis be extended to the Broyden family without applying Assumption 2? Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next