ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 2.00 4.00 1891
Fully human-written 3 (75%) 3.33 3.00 3612
Total 4 (100%) 3.00 3.25 3182
Title Ratings Review Text EditLens Prediction
Unraveling Syntax: How Language Models Learn Context-Free Grammars Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates how Transformer-based autoregressive models learn languages generated by Probabilistic Context-Free Grammars (PCFGs). The authors provide a theoretical framework showing that the Kullback-Leibler (KL) divergence loss decomposes across the PCFG's subgrammar structure. Empirically, using small Transformer models, they demonstrate that models reduce loss across all subgrammars in parallel, rather than mastering simple structures first. Their experiments also reveal that these models perform poorly on deeply recursive structures and that pretraining on subgrammars can improve the final loss for these smaller models. 1. The paper provides an elegant theoretical derivation (Theorem 4.6) that helps explain why Transformers perform poorly on deeply nested grammatical structures . The recursive formula for KL divergence based on expected recurrence is a strong conceptual contribution. 2. The experiments show that Transformer models reduce loss across all subgrammars simultaneously, which contrasts with the developmental stages observed in human language acquisition. 3. The work includes a clear set of experiments on small models demonstrating that subgrammar pretraining can improve final performance , offering a practical insight for training on structured data. 1. The key "understands composition" assumption, which forms the basis for Corollary 4.5 and Theorem 4.6, is very strong. The paper does not provide sufficient empirical validation to determine to what extent this assumption holds for practical, trained models, which limits the applicability of these specific theorems. 2. The experiments on Large Language Models (LLMs) are limited to anecdotal tests on "ChatGPT-5-Instant". This experiment may conflates two different tasks: calculating the result of an arithmetic expression and modeling the grammar of that expression. The task of calculating a result is not equivalent to the task of generating an expression from a grammar, especially from a model training perspective. Therefore, the difficulty an already-trained model shows in calculating a deeply recursive arithmetic expression is not direct evidence that a Transformer-based model would struggle to be trained on a deeply recursive PCFG (the paper's primary claim). 3. The comparison in Section 5 between pretrained models and those trained from scratch seems unfair. Pretraining on a subgrammar essentially gives the model a strong hint, or an "inductive bias," about the PCFG's correct structure. The baseline model, however, gets no such hints and must discover this hidden structure all on its own. Therefore, the finding that pretraining improves the final loss for smaller models might primarily demonstrate the benefit of curriculum learning and injecting this strong structural prior. 1. There appears to be two minor typo in the derivation in Section 4.2 (lines 211-213, page 4). In the second line of the equation. The correct expansion of - log Q_\theta(\alpha a\beta) should be [ - log Q_\theta(\alpha ...) - log Q_\theta(a|\alpha) - log Q_\theta(\beta|\alpha a)] The current text incorrectly uses positive signs for the second and third components. Additionally, the conditioning context for the final term appears as \beta|a\alpha, which seems to be a notational typo for the correct \beta|\alpha a. In the third term of the equation on line 214, the probability $P(a)$ should likely be written as $P_A(a)$. 2. ChatGPT-5-Instant and ChatGPT-5-Thinking seems not formal name of these two models. Fully human-written
Unraveling Syntax: How Language Models Learn Context-Free Grammars Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper investigates how transformers learn probabilistic context-free grammars (PCFGs). It first derives several theoretical properties describing how the KL divergence of PCFGs decomposes over subgrammars. Using this framework, the authors show that transformers learn all grammatical components in parallel, unlike humans who acquire syntax hierarchically. Experiments with small transformers confirm these findings and demonstrate that pretraining on subgrammars improves performance and yields more structured internal representations, though the benefit decreases for larger models. Finally, the study shows that models can manage long but shallow dependencies yet struggle with deeply recursive syntax, a limitation also observed in large language models. It is important to understand how transformers learn CFGs. Although previous work (Allen-Zhu et. al. 2023) has studied in detail of the mechanism on how transformers implement CFGs, the dynamic analysis remains blank. The paper studies this important problem. They reveal an interesting phenomenon that transformers lean all subgrammars simultaneously. Although the topic is important, the results feel incomplete and lack depth. 1. The theoretical part at the beginning is quite trivial and unnecessary for understanding the dynamics. The first main result only appears at the end of page 6, making the presentation hard to follow. 2. Regarding how transformers implement CFGs, Allen-Zhu et al. (2023) have already provided detailed and insightful analysis that largely covers the results here. The paper does not clearly explain its own contribution beyond that work. 3. The only genuinely new finding seems to be that transformers learn all subgrammars simultaneously, but the analysis is shallow. I would like to see more detailed investigations and ablations to uncover the mechanisms behind this behavior. None. Lightly AI-edited
Unraveling Syntax: How Language Models Learn Context-Free Grammars Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper aims to study how transformers learn context-free grammars as a step toward a theory of language modelling. The authors claim that transformer training on context-free grammars can be understood by decomposing the cross-entropy loss into contributions from subgrammars. Under the assumption that a language model "understands composition", the authors derive a decomposition of the loss that elucidates the role of subgrammars and their recursive occurrence. This is an interesting and timely direction, but the contribution is unclear to me. Due to this lack of clarity, it's impossible for me to properly judge this contribution unless some of my doubts are addressed first (see below). This is an ambitious and relevant topic, and I do believe that studying how language models acquire simplified formal languages is the path towards a predictive theory of large language models. ### Relation to previous literature First, the claim that the authors' work "initiates the study of the learning dynamics of transformers trained on PCFGs" is wrong. There are at least two groups contributing in this direction: https://arxiv.org/abs/2406.00048 (see also https://journals.aps.org/prx/abstract/10.1103/PhysRevX.14.031001?__cf_chl_rt_tk=UOc8KN3VOXQLYip1DgN1hgH_sgm85mQO4KFwHUmz38A-1761924271-1.0.1.1-05csO._aVplfQKTVpNHOiSYCZ.oTz8qXQePSbQqMnLE, https://arxiv.org/abs/2505.07070, ) and https://arxiv.org/abs/2305.13673v4 (one iteration https://openreview.net/forum?id=qnbLGV9oFL also shares a portion of the title with the present submission). These contributions should be acknowledged and their relation to the present paper discussed. --- ### Clarification of the exact contribution If I understand correctly, the main technical contribution is the proposed loss decomposition. What is the intended consequence of this result? For instance, the authors claim that the decomposition implies that the model learns all subgrammars in parallel. However, a decomposition of the population loss can only constrain the location of the final optimum, not the dynamics of optimisation. In particular, it has no direct implications for convergence rates, sample complexities, or learning order. What additional assumptions would be required to justify the claim of “parallel learning”? --- ### Technical claims 1. **Inconsistent notation (pp. 4–5).** Symbols change mid-derivation (e.g., (P(\alpha\dots)) vs (P(\alpha\mid\varepsilon))), and in *Definition 4.2* the dependence of the summand on (a) is unclear. Please fix the notation globally and clarify how (a) enters the RHS of Def. 4.2/ 2. **Application of Definition 4.2.** When (A) is a fixed string (as in (D_{\mathrm{KL}}(P\Vert Q)_{\alpha})), I cannot reproduce the final equation on p. 4. An explicit worked example would help verify the derivation. 3. **KL decomposition of Theorem 4.3. from Definition 4.2.** Please show the full derivation of the KL decomposition. 4. **“Understanding compositionality.”** This assumption is too vague. Does it hold before, during, or after training, and what measurable property defines it? 6. **Implications of Theorem 4.6.** Specify whether the result concerns the population optimum or predicts learning dynamics (rates, ordering, or sample complexity). If the latter, state the additional assumptions required for “parallel learning.” --- ### Empirical validation The empirical section does not clearly test the theoretical claims. If the main statement is that the overall loss equals the sum of the losses over subgrammars, this should be explicitly validated in a figure, rather than requiring the reader to infer it by visually summing multiple curves. Moreover, the role of Figure 2 is unclear relative to Figure 1---what new information does it convey? Finally, the dependence on the recursion number (R, or rather its expected value), is never systematically explored or quantified. See weaknesses section. Fully human-written
Unraveling Syntax: How Language Models Learn Context-Free Grammars Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The work analyses the language models’ learnability of PCFGs. They introduce a subrammar-based framework for probing an architectures learnability. A subgrammar decomposed KL is derived. The authors find that the loss decreases in parallel over the subgrammars. The work also considers the impact of pre-training and depth of the grammars. The idea of studying the learnability of sub-grammars is sound and interesting. And using formal languages is a, by now, common approach to study learnability. Using hierarchical subgrammars is a natural, and the decomposition of the KL is as expected. The pedagogical approach of the paper, laying out standard definitions and derivations is welcome. And giving "worked examples” of e.g. the KL decomposition at the bottom of page 4. The three kinds of analysis are nice (albeit a bit shallow individually) The work overstates the novelty of using CFGs to probe language models' learnability, e.g. in the introduction “this work initiates the theoretical and empirical study of training dynamics over CFGs” but this is repeated in many places. See for instance Sennhauser and Berwick 2018 (Evaluating the Ability of LSTMs to Learn Context-Free Grammars), or Allen-Zhu et al 2023 (How Language Models Learn Context-Free Grammars). Another drawback is that the grammars are tiny, and the model sizes not ablated with grooving grammars. I understand a framework is being suggested, but to justify the framework it would be good to do some kind of scaling. It doesn't have to be for very large models, or CFGs with thousands of rules. But sampling grammars (and averaging over them) and doing so for grammar sizes of e.g. 10,100,200,400,800 rules would indicate whether the findings are due to the simplicity of the languages or not. Especially in a section titled “How Language Models Learn” In line 301 you say you train on several synthetic CFGs on transformers, but the settings are missing or not referenced. I.e. architecture and training parameters, grammars sizes. In line 379 you make a claim about an effect that diminishes as models size and complexity increase but it seems to be backed by a single example. The ChatGPT-5-Instant experiment is unclear, what is the setup? It is also missing a citation or description. I recommend the authors fix the overly bold claims and smoothen the writing a bit, consider the related literature a bit better and strengthening the empirical study when revising their work. Ensuring full reproducibility is also important, make sure hyperparameters and configurations are reported and referenced. Minor: In the introduction you state that training an LLM for studying its training dynamics is unfeasible. Even so, there is plenty of room to study their learnability without reaching SOTA and no reason to believe the findings don't apply to the top models. You could condense the standard definitions (kl, entropy, mle etc) a bit as prose, but im ok with the definition environment. Please use numbers by equations (the equation or align environments), it’s hard to refer to them otherwise. In the activation space analysis you show that pretraining on in-domain data helps. How would you take this further, i.e. do you see some options for using more complex/gradual curricula? The deeper recursion experiment is interesting, do you see a way to connect it to the curricula approach? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next