ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (25%) 6.00 3.00 2234
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 4.00 3.67 3062
Total 4 (100%) 4.50 3.50 2855
Title Ratings Review Text EditLens Prediction
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provides a theoretical analysis of one-layer transformers trained via gradient descent to learn from a class of teacher models with bilinear structure. The teacher models covered include convolutional layers with average pooling, graph convolutional layers on regular graphs, sparse token selection models, and group-sparse linear predictors. The authors prove that transformers with simplified "position-only" attention can recover all parameter blocks of these teacher models, achieving optimal population loss with a convergence rate of Θ(1/T). They also establish out-of-distribution generalization bounds and validate their theory through experiments on both synthetic and real-world (MNIST) data. 1. Unified theoretical framework: The paper identifies a fundamental bilinear structure shared across diverse learning tasks, enabling unified learning guarantees. This is a significant conceptual contribution that connects previously studied disparate settings. 2. Tight convergence guarantees: The paper establishes matching upper and lower bounds for the convergence rate of Θ(1/T), improving upon prior work 3. Empirical alignment with theory. Synthetic experiments match predicted slopes and show early directional alignment between learned value matrix and teacher weights. 1. **Limited model complexity** The analysis is restricted to one-layer transformers with simplified "position-only" attention. While the authors justify this simplification empirically, the gap between this architecture and practical multi-layer transformers with full attention is substantial. 2. **Notation density**: The paper is extremely dense with notation making it difficult to follow. A notation table and more intuitive explanations would improve accessibility. 3. **Scalability concerns**: The convergence time $T^*$ grows quadratically with sequence length D, which may be prohibitive for longer sequences. This scalability issue is not discussed. 4. **Gap between theory and practice:** Theory–practice gap in positional encodings. Assuming a fixed D×D orthogonal positional matrix departs from learned or sinusoidal encodings used in practice; further clarification is needed to justify this assumption. See weaknesses. Heavily AI-edited
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models Soundness: 4: excellent Presentation: 4: excellent Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work presents a theoretical analysis of the optimization dynamics for learning a class of teacher models via a single layer transformer. This class of models satisfies a bilinear structure, and is general enough to recover the results of previous theoretical analyses. In particular, when optimizing over the population loss of the least squares error, the authors show a tight convergence rate, improving upon previous analyses. An additional result on mild OOD generalization as well as experimental results are provided. - The results provide a tight convergence rate of $\Theta(1/T)$ when training on the population loss, improving previous analyses while generalizing the setting. - Experimental results for different examples of teacher models are provided, which supports the theoretical analysis. - The setting is restricted to the population loss, as well as Gaussian data. - The generalization of teacher models is nice, but ultimately it is still the idea of sparse selection from the input. - Is there any example of a teacher model in this class that does not do sparse selection as a subtask? If not, what are the challenges in characterizing such a class that does something other than sparse selection? - I know this does not change the takeaway much since experimental results are provided, but what are the challenges in getting sample complexity guarantees? I suspect it will be some online SGD type analysis; would the authors elaborate on this? I am mostly curious about the first question, and I am happy to raise my score if the authors address it sufficiently. Fully human-written
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models Soundness: 2: fair Presentation: 4: excellent Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The work demonstrates both theoretically and empirically that a modified version of Transformers can mimic a collection of well-known models (GCN, CNN, etc.). Briefly, this modified version replaces the separation between query and key matrices in traditional Transformers with a single parameter and introduces position-only attention weights. Importantly, the authors provide proven polynomial-time convergence rates for the gradient descent-based minimization of the population loss for Gaussian data. On top of this, theoretical claims are well-positioned within the broader literature and clearly supported by a comprehensive empirical analysis. All in all, the main document is nicely written and provides apparently significant contributions for the theoretical understanding of Transformers. Nonetheless, the supplementary material is very dense and it is difficult to assess the correctness of the theoretical results. Lemma C.2, for instance, states that the rows of the learned value matrix are always a positive multiple of the rows of the true value matrix, and the proof is long and convoluted. Also, it is unclear why the data’s dimensionality should be larger than a polynomial function of $M$ and $K$, and why many of the presented inequalities break when $D = K$ (See weaknesses below). 1. The main document is very well written; the notation is clean and easy-to-follow. 2. The proposed convergence bounds are, AFAIK, the tightest known for this specific type of problem (which, as remarked by the authors, has been extensively studied lately). 3. The presented empirical results strongly agree with the introduced theoretical analysis. 4. Assumptions for both the Theorems and experiments are clearly stated (e.g., Gaussian data for in-distribution learning, exponential data for out-of-distribution learning, hypothesis class for optimization, Gaussian noise for the target, etc.). My main concern is regarding the lack of a clear explanation for the results obtained through extensive and dense calculation, as I will detail below. I will be more than happy to raise my score in case the following points are addressed. 1. Convergence rates of $\mathcal{O}(1 / T)$ (non-tight) are well-known for strongly convex problems (or, more broadly, for Polyak-Lojasiewicz-type problems). However, this is clearly not satisfied by the squared objective function in this work. Which other regularity conditions do the authors think have allowed these tight bounds to exist? 2. Lemma C.2, as I referred earlier, seems to be a strong result, and its demonstration is broken up into several pages with many hand-waved computations. Could the authors provide a short and intuitive explanation for this result? What is (if there is) a meaning for $C_1$, $C_2$, and $C_3$? 3. Also, demonstrations look invalid when $D = K$, that is, when $\mathbf{S}$ is a simple pooling matrix - this might be the case, e.g., when $g = [D]$ in Example 2.3. What is the fundamental reason for this constraint (if there is such a reason)? 4. On the same note, the bounds also seem to fail when $K = 0$, i.e., when all entries in $\mathbf{S}$ are identically null. As I understand it, this should be the easiest-to-learn scenario. Could the authors briefly discuss this restriction? 5. In line 1088, it is stated that an inequality follows due to Lemma E.1, which only derives a few equalities. Could the authors please elaborate on this regard? 6. Theorem 3.1 bound for $T$ depends linearly on the inverse of the quadratic norm of $\mathbf{V}^{\star}$, while Theorem 3.2 depends linearly on the quadratic norm of $\mathbf{V}^{\star}$. That is, the dependence is inverted in both results. Why is there such a difference between in-distribution and out-of-distribution results? 7. Could the authors provide the slope for the linear curves fitting the tails of Figures 2(a) and 2(b)? Also, do the authors plan to release the code? Please refer to the questions above. On top of that, calculations heavily rely on the Gaussianity of the data. When the data deviates slightly from a Gaussian - a t distribution or a Gumbel distribution - how do the authors believe that the results would behave? Would we keep the polynomial convergence rate, for instance? Fully human-written
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this paper, the authors study how simplified one-layer transformers with position-only attention trained via gradient descent can learn a class of teacher models defined through a bilinear structure combined with a non-linearity. Included in this class of models are, among others, convolution layers with average pooling, graph convolution layers, sparse token selection, and group sparse linear predictors. Several experiments on synthetic and real-world data are carried out to confirm the theoretical results. The paper is well written and well organized. The proofs are complex and they seem sensible to me, even though I could only superficially go over them. My main concerns with the paper are the strong theoretical assumptions and what I perceive is a somewhat unfair comparison with previous works. Regarding the first point, the assumption that the attention matrix only depends on the position of the tokens is extremely strong. In fact, this simply means that the attention matrix does not depend on the input at all, since the positions are the same for all inputs. (In Eq. (2.4), P is fixed and independent of X). With this assumption, the paper neglects what is arguably the main feature of attention, that is, to attend to other tokens depending on their value. With this assumption, the architecture precisely mimics the teacher class considered (compare Eq. 2.1 with Eq. 2.4), therefore it is not surprising that the simplified architecture can correctly learn this class (albeit, I concede that proving this fact is still challenging). Regarding the second point, I find the way the authors compared to some previous work quite misleading. Specifically, the authors claim that their work generalizes (Wang et al. 2024), which also consider how a simplified transformer architecture learns the sparse token selection task. However, the two works differ on two important points. Firstly, in (Wang et al., 2024), the fact that the attention only focuses on the positional information of X is shown to be learned during gradient descent, and not assumed from the start. Secondly, and more importantly, in (Wang et al., 2024) the attention matrix depends on the input. In fact, in their work, the sparse subset of indices over which X is averaged is sampled randomly and given as an input to the transformer. The transformer then selects the correct indices of X through the attention layer based on the input. In this paper, instead, the subset of indices over which the average is taken is presumably fixed before training for all inputs. This is a much simpler task, which explains why the attention is chosen to be input independent. Comparing the two works, and in particular their convergence rates, seems unfair. I would like the authors to address my main concerns expounded above. In particular, I'd like the authors to discuss how their work is still relevant despite the strong theoretical assumptions (in particular, the input-independent attention), and to discuss more thoroughly how their assumptions compare with those of the previous works they compare to (Wang et al., 2024; Zhang et al., 2025). As a minor side question, I see that S in (2.1) is assumed to have all non-zero entries to be equal to 1/K. Can your approach be generalized to non-uniform S. What are the main difficulties for this case? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next