|
LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce a benchmark, LudoBench, to evaluate the abilities of multimodal large language models on three categories of tasks—perceiving board game states, integrating rulebook knowledge to evaluate game states, and solving short-horizon optimization tasks (e.g. "which legal move would get the most points?"). The dataset consists of 400 carefully annotated examples across three board games in divers settings. The benchmark is far from saturated, with models scoring less than 70% on the easiest tier, and less than 10% on the hardest, while humans achieve scores in at least the 80s on all tiers. The authors detail their question-generation pipeline, and analyze interesting patterns across model performance (e.g., how models tend to over-update on images and therefore performance sometimes increases for text-only rulebooks).
This is a straightforwardly valuable dataset to the community; it bridges the gap between games-as-nice-theoretically-perfect-environments and "can models to real-world tasks" by looking at "understanding games in the real world". I specifically like the fact that all the questions are, on one hand, open-ended, but on the other hand have specific, well-defined, and unique answers. I especially appreciate the authors' efforts to check every item in the dataset for correctness and resolve ambiguities; this plagues most ML eval sets, and seems to be thoroughly avoided here! The detailed analyses of failure modes and ablations are well-executed.
I think 400 data points is on the smaller side for a dataset, though this is mitigated by the fact that the points are all high-quality. Also, having only three board games is an obvious limitation. Probably the most major current weakness is the fact that the models evaluated are all decently out-of-date; if accepted, I would encourage the authors to update the main tables with the most recent models available.
I wonder to what extent the generation of these scenarios is automatable; for instance, if there are online implementations of these games that track board state, it might be easy, at the very least, to get a lot of tier 1 input-output pairs? |
Fully human-written |
|
LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper examines the multimodal reasoning capabilities of LLMs within the complex, rule-driven setting of tabletop strategy games. The authors propose a three-tier framework of game understanding — environment perception, heterogeneous rule integration, and short-horizon optimization — and evaluate how well models perform across these layers. Using a newly constructed benchmark, LUDOBENCH, which includes three real-world board games of varying complexity, the study assesses eight state-of-the-art multimodal LLMs (including GPT-4.1, Gemini 2.5 Pro, Qwen-VL 2.5 32B, and Claude 3.7 Sonnet) under three rulebook input conditions: text, image, and none. The results reveal that even at the lowest tier—environment perception—performance is inconsistent, and it degrades sharply when rule retrieval and planning are required. Common failure modes include misreading cluttered board states, retrieving but misapplying rules, overlooking opponents’ states, and hallucinating illegal actions. Taken together, these findings indicate that current LLMs are not reliable “rule oracles” for analog tabletop play and underscore the need for stronger cross-modal grounding and faithful rule execution.
1. Clear, well-structured writing and figures make the method and setup easy to follow.
2. The paper’s three progressive competencies — environment perception, heterogeneous rules integration, and short-horizon optimization — mirror how humans build understanding step by step. This structuring makes results interpretable and helps localize failure modes.
3. The experiments show models often perform worse with visual inputs than with plain-text rules — the opposite of human behavior. This is an interesting, discussion-worthy result.
4. The study employs a multi-stage annotation pipeline that includes checks for annotator–expert agreement, which enhances the reliability of the benchmark. This methodological rigor improves the transparency, reproducibility, and credibility of the reported results.
1. While tabletop games offer well-defined, structured rule systems, they are highly domain-specific and differ markedly from real-world reasoning contexts. In this setting, rule comprehension can overshadow genuine reasoning, which may limit the benchmark’s ability to reflect broader inferential capability. Consequently, it is unclear how LUDOBENCH performance should be interpreted as a measure of general reasoning: the paper provides no evidence that its scores correlate with wider reasoning skills or transfer beyond tabletop domains.
2. Many tabletop rulebooks and gameplay descriptions are publicly available online. How do the authors ensure that the evaluated models had not already seen the same or similar rule texts during pretraining? If not, superior performance under text-based conditions might partially reflect memorization rather than genuine reasoning or rule integration.
3. The benchmark covers only three tabletop games with a steep difficulty jump (1.2 → 2.6 → 4.6) and narrow genre coverage. This constrains external validity and makes it hard to trace how performance scales with incremental complexity.
1. The paper states that the setting “reflects the experience of real-world gamers during first-time gameplay,” and the conclusion emphasizes “one-shot” scenarios. However, the human baselines (expert / hobbyist) are clearly not first-time players. How do the authors reconcile this with the stated goal, and would a novice baseline alter the model–human gap?
2. In some T1-Perception results (e.g., model o1 on KingDomino and Res Arcana), the no-rulebook condition outperforms having rules. What explains this inversion?
3. Many cards in tabletop games contain stylized artwork that diverge from LLM pretraining data or real life. Given that even basic perception is challenging, the study should first separate perception from reasoning with a dedicated pretest (e.g., card/asset/symbol recognition) before tier-by-tier game reasoning. |
Lightly AI-edited |
|
LLMs as Rules Oracles: Exploring Real-World Multimodal Reasoning in Tabletop Strategy Game Environments |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces LUDOBENCH, a new multimodal reasoning benchmark designed to evaluate the ability of VLMs to act as "rules oracles" for complex tabletop strategy games. The task challenges a model to, given a visual depiction of a game state and its corresponding rulebook, answer questions a new player might have. The benchmark is structured into three cumulative tiers of increasing difficulty: (1) Environment Perception (visual understanding of the game state), (2) Heterogeneous Rules Integration (applying rules from the rulebook to the state), and (3) Short-horizon Optimization (simple planning and tactical puzzles).
The authors curated a dataset of 400 question-answer examples across three popular tabletop games of varying complexity: Kingdomino, Res Arcana, and Pax Renaissance 2nd Edition. Evaluating eight state-of-the-art VLMs, the paper reveals significant shortcomings in current models. Even the best models achieve only ~68% accuracy on perception tasks and fall below 10% on optimization puzzles, in stark contrast to human hobbyist players who solve even the hardest puzzles with over 80% accuracy. The paper provides an extensive analysis of failure modes, identifying critical weaknesses in cross-modal grounding, spatial reasoning, state tracking, and the application of abstract rules to messy, unfamiliar environments.
-The paper addresses an important gap in AI research: while models have achieved superhuman performance in games like Chess and Go, this work presents the more realistic challenge of learning and reasoning in an environment where rules are not pre-encoded but must be acquired and grounded from multimodal sources.
- Game Selection: The choice of three games with a clear progression in rulebook length, component density, and mechanical complexity provides a strong basis for evaluating model scalability and robustness.
- Tiered Structure: The three-tiered evaluation (Perception, Integration, Optimization) is logical and effectively isolates different capabilities. This cumulative design allows for a clear diagnosis of where reasoning breaks down—a model cannot succeed at Tier 2 without competency in Tier 1.
- The findings are clear: the huge performance gap between SoTA models and humans is a valuable negative result for the community, highlighting how far these models are from human-like comprehension in this domain.
- The paper is well written and the figures are compelling.
- The provided link to access LudoBench does not work.
- The benchmark's scale is limited. The results, while conclusive for these specific games, may not generalize to the thousands of other tabletop games. The combinatorial nature of game states and rules is vast, and a larger, more diverse set of games and questions would further solidify the paper's claims.
- The paper presents results based on a fixed prompt structure. Given the significant impact of prompt engineering on LLM performance, the results could be sensitive to this choice. It is unclear if more sophisticated prompting techniques (for example breaking down the query into Chain-of-Thought steps for the model) could provide better performance. A more detailed ablation study on prompting would strengthen the conclusion that the failures are fundamental to the models, not just the evaluation setup. The authors could also report the top-k accuracy to highlight if the models sometimes know the right answer (which would indicate that they could get way better at this task with some finetuning).
- Why would this benchmark in particular be more valuable than all the others that already exist? |
Lightly AI-edited |