|
Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper investigates whether large language models (LLMs) are genuinely capable of learning new tasks in an in-context learning (ICL) setting. To study this question, the authors design controlled experiments based on simple formal language tasks generated from random deterministic finite automata (DFAs) with three states and three symbols. They introduce two tasks: (1) sequence completion, where the model must complete a query prefix consistent with in-context examples of a random DFA, and (2) transduction, where the model must predict the *transducer* output corresponding to an annotated input sequence drawn from a DFA.
The authors evaluate a broad range of pretrained open-weight and proprietary LLMs under different prompting strategies, from basic prompts with minimal explanation accompanying the in-context examples to more elaborate chain-of-thought (CoT) prompts. They compare these results against several non-LLM baselines, most notably simple n-gram models that simply match query patterns directly to the in-context examples.
Empirically, they report that the n-gram baseline consistently outperforms the LLMs across both tasks, and CoT prompting fails to produce systematic improvements, highlighting a fundamental limitation of current LLMs in performing genuine novel task learning through the ICL framework.
The paper addresses an interesting question about the nature of in-context learning in LLMs. The setup is well-motivated, accompanied by a thorough related work section. The authors justify their benchmark design choices well and provide clear reasoning behind each task setup.
The experimental evaluation also covers a wide range of pretrained LLMs, along with detailed reporting of implementation, prompting formats, and compute usage. The additional ablations, such as varying the number of in-context examples and analyzing the effects of tokenization, add more depth to the empirical analysis.
(See questions)
1. **On the complexity of the task**: The authors motivate their design choice of using simple DFAs, but I am concerned that the task may be too simple to meaningfully assess whether LLMs learn the underlying grammar. Mainly, the n-gram baselines achieve over 90% accuracy, indicating that simple pattern copying can result in a near-perfect performance on this task, without any knowledge of the grammar structure. Isn't this a weakness of this benchmark for assessing the model's capability in capturing "world model"?
2. **Number of in-context examples**: It may be the case that LLMs can learn the task in-context, but are not sample efficient. If so, providing only a few in-context examples might underestimate their ability to pick up the underlying transition patterns.
Figure 5, which explores the impact of increasing the number of examples, partially addresses this. However, since LLM performance improves much more sharply than the n-gram baseline as the number of examples grows, I wonder whether LLMs could eventually catch up with enough in-context data.
3. **Novelty of the completions**: Other than generating a valid output/completion, another performance metric for uncovering the ICL mechanism is to check the novelty of the generated outputs themselves: When a model produces a correct completion, is it simply copying a pattern present in the in-context examples, or is it generating a new suffix consistent with the underlying grammar?
One possible diagnostic would be to compare the valid LLM completions with those pattern(s) matched by the n-gram model to check whether the LLM is reproducing existing context patterns or generating genuinely new suffixes that reflect partial inferred structure (partial, since the performance is not perfect).
4. **Line 478**: ``We believe our results suggest that LLMs have learned individual models of particular languages, but not a general theory of language.'' Could you clarify this statement? What does it mean that LLMs “have learned individual models of particular languages,” and how is this conclusion supported by the presented results?
5. **Minor**: Could you clarify the *common-suffix* baseline? Does the completion "*always* end in an accept state", or does it refer to the completion that most frequently leads to an accept state across queries? |
Fully human-written |
|
Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper evaluates the in-context learning (ICL) capability of large language models (LLMs) on tasks pertaining to regular languages recognized by 3-state deterministic finite automata (DFAs). The main result is that certain n-gram models outperform LLMs on sequence completion and transduction tasks on such regular languages. Overall, the paper extends previous works on in-context learning of regular languages to pre-trained LLMs, but suffers from major concerns regarding claims and contributions.
1. The presentation of the experiments and results is clear without any obvious/major issues.
2. The evaluations cover a wide variety of models, including open-weights, open-code, and proprietary models.
1. The novelty of the findings in this paper is rather limited and does not justify the claims made in Section 1 (contributions). In particular, the authors claim that they introduce an LLM ICL benchmark for novel tasks using the regular languages, but do not justify the motivation for such a new benchmark when compared to existing works, such as RegBench in [1]. Also, no experiment/result discusses the effects of RLHF on the ICL performance. I would suggest a rephrasing of the claims to avoid such confusion.
2. The transductive task can also be treated as a symbolic reasoning task without associating any language with it. The task considered in this paper is to check if there is an even number of a's" in the string or not. Similar character counting tasks have been explored in [2] as well (which can be treated as a variant of the prompt where the task is revealed in the instruction). To this end, I believe that the experiments in this work are rather incremental and lack a strong motivation.
I am willing to engage in a discussion and better understand the author's perspective on this matter. Furthermore, since an LLM has to be trained on multi-lingual data for solving multi-lingual tasks, it is evident in the literature that it is not learning a general theory of language. Also, because the transductive tasks can be framed as symbolic reasoning tasks, I am not convinced of the importance of the current results and the importance of framing such tasks as "random language tasks".
[1] Akyürek, Ekin, et al. "In-Context Language Learning: Architectures and Algorithms." International Conference on Machine Learning. PMLR, 2024.
[2] Shin, Andrew, and Kunitake Kaneko. "Large Language Models Lack Understanding of Character Composition of Words." ICML 2024 Workshop on LLMs and Cognition.
1. **see weaknesses for the major questions.**
2. What can be some future research directions based on this work?
3. Sampling a string from a DFA can also be framed as sampling from a markov chain. The work by [3] explores this problem of learning such inherent structures as well. Can the authors discuss the similarities/tradeoffs with such works since they are closely related?
4. How does the cardinality of states of a DFA affect performance? As of now the paper considers 3 states but say without the brute-force baseline and closed-soiurce models (which maybe expensive), if we increase the state count to 4,5,6 etc, do we see a pattern in open-weights/code LLM performance?
minor: recheck grammar. For example in section 2.3 (second paragrah: "We also LLM ICL but push ...".
[3] Edelman, Ezra, et al. "The evolution of statistical induction heads: In-context learning markov chains." Advances in neural information processing systems 37 (2024): 64273-64311. |
Fully human-written |
|
Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a benchmark designed to evaluate the in-context learning (ICL) capabilities of large language models (LLMs) in the context of alien (synthetically generated) language reasoning. The authors test several LLMs on this benchmark and find that even state-of-the-art models struggle with even the simplest reasoning tasks in this setting.
1. The paper introduces a novel benchmark that specifically targets and isolates the in-context learning ability of LLMs in a controlled setting involving synthetic or "alien" languages.
2. The authors provide a comprehensive comparison of multiple LLMs and statistical models, offering a broad overview of current model capabilities on the proposed task.
1. While the use of next-token prediction accuracy is straightforward, the paper could benefit from including additional evaluation metrics—such as error type analysis or model calibration measures—to provide deeper insights into the specific failure modes of the models.
2. The analysis of why LLMs perform well on natural or regular languages but fail on randomly generated ones is limited. A more in-depth investigation into this contrast would strengthen the paper's impact.
1. What key insights does this benchmark provide regarding the limitations of current LLMs in in-context learning? And what potential directions could address these limitations? |
Fully AI-generated |
|
Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors study the in-context learning ability of large language models through the lens of regular languages, in particular whether these models can complete sequences or answer questions about them (sequence completion vs transduction tasks). A large scale study is conducted on regular expressions accepted by 3-state deterministic finite automata, where each LLM gets as input a sequence of examples that are accepted by a randomly drawn automata and then a partial sequence for which it has to provide a completion. The results indicate that while LLMs are able to solve a lot of novel tasks at inference and provide complex reasonings, they are still behind in terms of general purpose meta-learning and currently lag behind even n-gram methods when it comes to regular languages.
- The authors study a clear hypothesis of whether current LLMs act as general-purpose in-context learners and highlight through a study with regular languages that they are not; thereby clearly validating their hypothesis. In fact, it is a clear study with a testable hypothesis and the authors succinctly provide a conclusion to the question that they ask.
- The evaluation conducted is quite thorough from the lens of the number of DFAs and examples used to evaluate the models as well as the various different LLMs used, of varying families, sizes and capabilities.
- While I commend the authors on the clear, well-studied and scoped-out problem formulation, I struggle to grasp the more general benefit behind such an analysis. What the authors show is a clear existence of a problem which can be templated within the language domain on which LLMs struggle. However, a discussion along the lines of *no free lunch* would have been quite helpful, in fact it is not unbelievable that such models would struggle on a lot of language-based tasks which are, in some sense, not well encapsulated within their pre-training methodology. The benefit of ICL is not that they can solve *all* possible problems but that they can solve a number of novel problems that are of interest to us. In fact, if one was to think of ICL as a learning mechanism one can come up with tasks where a general learner like SGD would fail; the benefit of SGD (or ICL) is that it is just generally good at tasks of interest to us.
- There was a lack of analysis into task complexity, in particular from the draft we know that LLMs struggle with regular languages in-context and we also know that in general language tasks they have shown tremendous progress. It would have been nice to see some analysis on how such models behave going from Type 3 to Type 0 languages, and also within Type 3 language going to a small state DFA to a larger state DFA.
- Given that there are no consistent trends between model size and performance, whether it be across or within a model family, it suggests that the phenomena the authors are seeing is a consequence of just the evaluation task being quite out-of-distribution from the training regime. The lack of trends between Basic and CoT prompting, among other variants, also reinforces this hypothesis. In light of this, I would appreciate if the authors could comment on the downstream usefulness of their finding, except highlighting just the existence of a task on which LLMs fail?
- LLMs are in general somewhat sensitive to how the prompts are formalized as well as the overall template in which observations are provided. From that perspective, have the authors done some form of sensitivity analysis where the characters a/b/c are replaced by different random words or tokens? For example a/b/c could be replaced by 0/1/2 or cat/dog/bat. It would be important to understand if ICL even learns this invariance from in-context observations, and understand its implications on the current hypothesis. |
Fully human-written |