|
Emergent Chess Skill Acquisition in Large Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies how language models acquire chess skills when trained on algebraic chess notation. By introducing a disambiguation-aware tokenization scheme and train models of varying depths (5-25 layers) on different datasets to study the emergence of capabilities. They observe clear developmental patterns: shallow models struggle with move legality, while deeper models develop tactical and positional understanding. Models trained on balanced game outcomes consistently outperform those trained only on white-win games.
- The paper is well-organized and clearly written.
- The intuition of this paper is great.
- The largest model studied (25 layers, ~100M parameters) is relatively small by current standards. It's unclear if the observed patterns would hold at scales of billions of parameters.
- The paper doesn't compare performance against purpose-built chess engines. This makes it difficult to assess overall performance compared to other methods.
- The paper lacks information about the computing resources needed for training.
- The paper lacks cast studies.
Please refer to the "Weaknesses" section. |
Lightly AI-edited |
|
Emergent Chess Skill Acquisition in Large Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper studies how language models acquire chess-playing abilities when trained on algebraic chess notation. The authors introduce a custom disambiguation-aware tokenization scheme and train models of varying depths on datasets. The paper reveals an approach similar to curriculum learning, with rule comprehension emerging early and higher-order abilities following later.
- The motivation of the paper is sound.
- The paper is well-structured with clear method descriptions and results presentation.
- The paper is titled with "Large Language Models." However, the maximum size of the models trained in the paper is 100M parameters, which is relatively small.
- As mentioned in Section 5.3, evaluations used only 10 games per configuration, which may limit the robustness of the proposed method, especially for cases like sacrifices or complex tactics.
- There's no analysis of how the custom tokenization scheme impacts learning compared to other alternatives.
NA |
Lightly AI-edited |
|
Emergent Chess Skill Acquisition in Large Language Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates chess skills in decoder-only transformer models trained from scratch on algebraic chess notation. The authors focus on the training dynamics and developmental trajectory of these skills, rather than final performance. They systematically vary model depth (5 to 25 layers) and the training data distribution (a balanced dataset vs. a white-win-only dataset). Using a custom, disambiguation-aware tokenization scheme, they analyze the emergence of three hierarchical levels of competence: rule comprehension, tactical execution, and strategic planning. The paper concludes that chess provides a valuable, interpretable benchmark for studying how structured, hierarchical reasoning emerges in language models.
The dynamics of skill acquisition rather than just end-state performance is interesting.
The study is well-designed varying variables: architectural depth and data distribution.
The evaluation is good, moving beyond simple win rates or Elo.
The current evaluation protocol appears to test the models as the White player. It would be beneficial to clarify if any experiments were conducted with the model playing as Black.
There seems to be a slight inconsistency in the evaluation methodology that I would appreciate clarification on. Rule comprehension is measured based on unconstrained generation, whereas the strategic evaluation uses prefix-constrained decoding to enforce legality. Could the authors explain the rationale for this dual approach? I wonder if this might decouple the model's strategic choices from its internal rule knowledge, potentially affecting the interpretation of the strategic metrics for shallower models that have not yet mastered legality.
The paper mentions that the training data was filtered to include games between 80 and 200 plies. Could the authors elaborate on the justification for this specific range?
The custom disambiguation-aware tokenization scheme is an interesting feature of the methodology. Could the authors explain why this hand-engineered approach was chosen over standard, data-driven subword tokenization methods like BPE?
Please refer to the weaknesses |
Moderately AI-edited |
|
Emergent Chess Skill Acquisition in Large Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Using chess as the research domain, the study examines how models acquire various chess skills from scratch. Lower-level skills, such as making legal moves, are learned early in training, whereas higher-level strategies, such as sacrificing pieces, are only acquired in the later stages.
Provides a detailed characterization of skill acquisition during the model’s training process.
1. **I am not an expert in explainable AI!**
2. I find the **article’s conclusion quite obvious: higher-level skills are learned later in training**. This is predictable and does not provide the reader with additional insights. I suggest the authors focus on discussing how the existing findings in the paper can inform better strategies for training models.
see weakness |
Moderately AI-edited |