ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (20%) 6.00 3.00 2261
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (20%) 6.00 2.00 3020
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (60%) 6.00 2.00 2991
Total 5 (100%) 6.00 2.20 2851
Title Ratings Review Text EditLens Prediction
Better LMO-based Momentum Methods with Second-Order Information Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper theoretically provesthat the proposed LMO-HCM methods can achieve a convergence rate of $O(1/K^{1/3})$. This rate is superior to the $O(1/K^{1/4})$ of standard LMO momentum and matches the optimal known rate for non-convex stochastic optimization. Most critically, this $O(1/K^{1/3})$ rate is guaranteed under both **"relaxed smoothness" ($(L_0, L_1)$-smoothness)** and an **"arbitrary norm"** setting simultaneously, greatly expanding the applicability of second-order momentum methods in deep learning theory. In training experiments on MLP and LSTM models, the authors demonstrate that their methodssignificantly outperform standard Polyak momentum and Extrapolated momentum in reducing training loss and gradient norm . The core contribution is extending the optimal $O(1/K^{1/3})$ convergence rate from Euclidean settings to the arbitrary norm LMO framework, crucially under the more practical $(L_0, L_1)$ relaxed smoothness assumption. On non-convex tasks like MLPs and LSTMs, the proposed SOM-V2 and $\beta$-SOM-V2 methods consistently and significantly outperform the Polyak momentum baseline in both convergence speed and final loss, validating the theory. The method naturally applies to various norms LMO can handle. Experiments demonstrate this flexibility by comparing performance under both $l_2$ and $l_{\infty}$ norms. This paper does not provide result about matrix norms, it is good to see it but the current version is already good enough. The experiments compare iteration counts, but Algorithm 1 introduces a Hessian-vector product (HVP) computation at every step. In practical training (e.g., for LSTM), how does this extra computational overhead (wall-clock time) from HVP compare to first-order methods (like Polyak Momentum)? The core of the paper is the LMO framework (applicable to arbitrary norms), but the experiments mainly focus on the $l_2$ and $l_{\infty}$ norms (Figs 8, 9). How does the method perform under more complex norms that truly require an LMO, such as the matrix norms used in Muon/Scion? Fully human-written
Better LMO-based Momentum Methods with Second-Order Information Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies momentum methods within the Linear Minimization Oracle (LMO) framework and integrates second-order (Hessian-corrected) momentum to obtain faster convergence under relaxed smoothness and arbitrary norm settings. Concretely, it adapts two known second-order momentum variants to LMO (Algorithm 1) and proves $O(K^{-1/3})$ rates in expected gradient norm, improving upon the $O(K^{-1/4})$ guarantees for LMO methods with Polyak momentum and matching best-known rates for second-order momentum in Euclidean $L$-smooth settings (Theorems 1-2). The analysis relies on symmetric $(L_0,L_1)$ gradient smoothness and, for one variant, symmetric $(M_0,M_1)$ Hessian smoothness, with unbiased, variance-bounded gradient/Hessian oracles. Empirically, on MLP and LSTM training (plus a logistic regression task in the appendix), the proposed second-order LMO methods (with and without a scaling factor $\beta_k$ outperform Polyak and extrapolated momentum in training loss and gradient norm; $\ell_\infty$ geometry yields less stable trajectories than $\ell_2$. 1. Establishes $O(K^{-1/3})$ convergence for LMO-based momentum with arbitrary norms and relaxed smoothness, improving on the $O(K^{-1/4})$ bound for LMO+Polyak momentum and aligning with best-known second-order momentum rates in Euclidean settings (Theorems 1-2). 2. Well-structured presentation: Algorithm 1 is easy to implement, assumptions are grouped and referenced, and Table 1 positions the results against prior LMO and second-order momentum rates; figures make the geometry ($\ell_2$ vs $\ell_\infty$) effects concrete. 3. Numerical Experiments: On MLP and LSTM tasks, the proposed second-order LMO variants (including the $\beta_k$-scaled version) consistently reduce training loss and gradient norm faster than Polyak and extrapolated momentum, and the instability under $\ell_\infty$ geometry is surfaced as a useful diagnostic. 1. Empirical validation is modest in scope and scale: MLP on a 1k-sample dataset and a PTB LSTM, with plots primarily of training loss and gradient norm; there are no validation/test metrics (e.g., perplexity), runtime, or wall-clock/throughput comparisons to quantify the extra cost of Hessian-vector products. 2. The paper cites that HVPs are “roughly the same time as computing the gradient,” but does not measure this in practice; thus the compute-efficiency trade-off of the proposed methods remains unclear. 3. Finally, while related work covers STORM/MARS and other improved momentum variants, the empirical comparison omits them, limiting the practical positioning of the proposed methods relative to the strongest baselines. See weaknesses. In particular: 1. Can you expand the empirical validation beyond an MLP on a 1k-sample dataset and a PTB LSTM? 2. Can you provide measured timings (per step/epoch) to clarify the trade-off between HVP vs computing the gradient? 3. Could you add baselines such as STORM and MARS (and other improved momentum variants) to better compare the proposed methods? Moderately AI-edited
Better LMO-based Momentum Methods with Second-Order Information Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies LMO-type optimizer combined with second-order momentum. LMO optimizers is a generalization of the recently popular Muon optimizer, which has update form $x_t = x_{t-1} - \eta_t lmo(m_t)$, where $m_t$ is the momentum buffer and $lmo(g) = \mathrm{argmin}\_{\\|x\\|\le 1} \langle g,x\rangle$ where $\\|\cdot\\|$ could be an arbitrary non-Euclidean norm. Unlike standard previous analysis of Muon and general LMO optimizers which used standard first-order momentum $m_t = \beta_t m_{t-1} + (1-\beta_t) g_t$, this paper considers Hessian-corrected momentum with an additional Hessian-vector product $\nabla^2 f_t(x_t) (x_t-x_{t-1})$. With the additional second-order information and second-order smoothness assumption, this paper shows that the resulting optimizer achieves $O(1/K^{1/3})$ convergence rate, matching the optimal rate using Euclidean norm. Finally, this paper also includes experiment results comparing the proposed algorithm with previous ones. This paper extends the study of LMO-type optimizers by incorporating second-order Hessian-corrected momentum. It provides convergence analysis and proves that LMO-type optimizer with second-order momentum achieves the optimal $O(1/K^{1/3})$ rate under second-order smoothness. Moreover, empirical experiments also show that the proposed optimizer has better performance compared to other baselines. There are two main concerns overall: - I find the discussion related to variance reduction algorithms and the Mars scaling factor confusing and deviated from the main part of this paper. From my understanding, variance reduction algorithms are vastly different from Hessian-corrected momentum. While this paper seems to focus on the latter, the former seems unrelated. Moreover, it is unclear to me what's the role of the scaling factor $\beta_t / (1-\alpha_t)$ in the proposed Algorithm 1 and related discussion is absent. More importantly, in the convergence analysis (Thm 1 and 2), $\beta_t$ is set of $1-\alpha_t$, making the scaling factor constantly one. It seems to me that adding this scaling factor is totally unnecessary, since it does not give any advantage and is cancelled anyways. - The convergence rate has constants $\bar \rho, \underline \rho, \bar \theta, \underline\theta $ that are carried over from the variance bound with respect to Euclidean norm. These constants could pick up implicit dimension dependence and significantly weakens the adaptivity to geometry using non-Euclidean arbitrary norms. For example, consider the infinity norm $\\|\cdot\\|_\infty$ and its dual one-norm, which has $\underline\theta = 1/\sqrt{d}, \bar\theta=1$ and $\bar\rho=\sqrt{d}, \underline\rho=1$. With this limitation, the convergence rate does not improve from previous rates of Hessian-corrected momentum with Euclidean norm (e.g., Salehkaleybar et al. 2022 and Tran & Cutkosky 2022). In fact, we could simply apply the identity $\underline\rho \\|\cdot\\|\_2 \le \\|\cdot\\|\_* \le \bar\rho \\|\cdot\\|\_2$ on the previous Euclidean norm bound of form $\mathbb{E} \\| \nabla F(w)\\|_2 = O(1/K^{1/3})$ (e.g., Tran & Cutkosky 2022 Thm 1) and get the same result. - A few comments on the related works: - line 65 typo: $K^{2/5}$ -> $K^{2/7}$; also I think [1] should be a better reference for this rate, as the current reference Cutkosky & Mehta 2021 focuses more on heavy-tail noise and the core technique follows from [1]. - For the lower bound part, Arjevani et al 2023 proves the lower bound of standard smooth non-convex optimization is $O(1/K^{1/4})$ and lower bound of mean-square smooth (which corresponds to variance reduction algorithms such as Storm and Mars) is $O(1/K^{1/3})$. However, such lower bounds are only for first-order oracles and doesn't apply to results like Salehkaleybar et al 2022 and Tran & Cutkosky 2022. Instead, the lower bound for second-order smooth problems with second-order oracles is from another paper [2], which proves a corresponding lower bound $O(1/K^{1/3})$. - line 220: I think [3] should also be mentioned as it's the original source of signSGD. - Assumption 4: typo: $\\|\nabla^2\\|_ *$ -> $\\|\nabla^2\\|_{op}$ - The empirical experiments only include simple tasks with small scale models such as MLP and two-layer LSTM. Larger scale models on more complicated tasks are encouraged to better demonstrate the performance of the proposed optimizer. [1] Cutkosky, A. and Mehta, H., “Momentum Improves Normalized SGD” [2] Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Sekhari, A., and Sridharan, K., “Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations” [3] Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A., “signSGD: Compressed Optimisation for Non-Convex Problems” Fully human-written
Better LMO-based Momentum Methods with Second-Order Information Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. The paper integrates second-order information, via Hessian-Corrected Momentum (HCM), into the LMO-based momentum framework, and proposes a rigorous theoretical analysis proving that the second-order LMO methods achieve an accelerated convergence rate of $\mathcal{O}(1/K^{1/3})$, under a relaxed $(L_0, L_1)$-smoothness assumption and in an arbitrary norm setting. The authors support their theoretical findings with experiments on several non-convex problems. 1. The paper is well-written and structured, providing a clear summary of the previous methods. Moreover, this paper clearly introduces the existing theoretical challenges, which are mathematically integrating second-order Hessian information into the arbitrary-norm LMO framework and proving its fast $\mathcal{O}(1/K^{1/3})$ rate. 2. The paper proves second-order methods like HCM achieve an accelerated $\mathcal{O}(1/K^{1/3})$ rate, matching the optimal known rate for this problem class and breaking the $\mathcal{O}(1/K^{1/4})$ barrier of their first-order counterparts. The theoretical contribution is non-trivial, because the proofs hold under relaxed smoothness assumption in arbitrary norms settings. 3. Experiments in various tasks validates the theoretical claims in this paper. 1. While the paper's motivation mentions optimizers like Muon, which are utilized in large-scale models, the empirical validation in this paper is conducted on relatively small-scale problems. 2. The paper's theoretical guarantees are general, holding for arbitrary norms. However, the experimental validations seem limited to the $l_2$ and $l_\infty$ cases. Moreover, the $l_\infty$ case seems to show only minor benefit (see Q1). 1. In Appendix F.7, the paper provides the analysis where the LMO is constrained by the Infinity norm. Is your experiment here showing that the theoretical acceleration does not translate into a observable speed-up on $l_\infty$? If the theoretical acceleration disappears in the $l_\infty$ setting due to instability, what is the practical benefit of the arbitrary norm guarantee? I feel like this is a disconnect between the theory and the experimental validation. Fully human-written
Better LMO-based Momentum Methods with Second-Order Information Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces two second-order momentum variants for the Linear Minimization Oracle (LMO) framework, using Hessian-vector products as correction terms. The authors prove an improved convergence rate of O(1/K^{1/3}) under relaxed smoothness and arbitrary norm assumptions, extending beyond the usual Euclidean setting. Experiments on MLPs and LSTMs show consistent improvements over standard Polyak and extrapolated momentum, supporting the theoretical results. * Strong theoretical contribution that raises the convergence rate of LMO-based methods from O(1/K^{1/4}) to O(1/K^{1/3}). * Analysis covers arbitrary norms and relaxed smoothness, making the results broadly applicable to deep learning settings. * Two well-motivated algorithmic variants, with and without Hessian smoothness, clarify trade-offs in assumptions. * Experiments show consistent gains across tasks and norms, matching the theoretical expectations. * Clear connection between theory and experiment, with helpful discussions and well-organized comparisons to prior work. * Experiments are small-scale and do not test large or modern models where Hessian-vector products may become costly. * No wall-clock or computational cost analysis to justify practical efficiency compared to first-order baselines. * Missing comparisons with strong modern baselines like STORM, MARS, or adaptive optimizers such as Adam. * Theoretical analysis focuses on βₖ = 1 − αₖ, while experiments use different βₖ values without full explanation or theoretical support. * Limited discussion of hyperparameter sensitivity and robustness, especially regarding smoothness constants and learning rates. * Some algorithmic details and choices (norms, LMO sets, βₖ schedules) are not fully specified for reproducibility. 1. How expensive are the Hessian-vector products in practice compared to standard gradient steps, in wall-clock time? 2. What strategy was used for choosing βₖ in experiments, and how stable are the results to different values? 3. Could the authors include comparisons with other O(1/K^{1/3}) methods like STORM or MARS? 4. How do these methods scale to larger models such as Transformers or large CNNs? 5. Do the observed training improvements also appear in validation or test performance? Fully AI-generated
PreviousPage 1 of 1 (5 total rows)Next