ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 3216
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 3.00 4278
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 2.00 5.00 2269
Total 4 (100%) 3.00 4.25 3008
Title Ratings Review Text EditLens Prediction
VideoGameBench: Can Vision-Language Models complete popular video games? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Evaluate whether vision-language models (VLMs) can perform human-like embodied reasoning by playing full video games from the 1990s using only visual input and minimal instructions. Experiments show all models fail to exceed the start of any game. - Challenging environments and strict train/test split are well-needed. - Use of perceptual hash to estimate game completion is interesting. - Many similar benchmarks already exist, as the author's have cited. - Evaluation scope narrow: experiments limited to a handful of VLMs; no systematic scaling or modality ablation. - Can the author's correlate and report perceptual hash with some ground truth measure of the progress in a game, i.e. using the walkthrough videos? - Additionally, how does perceptual hash perform for games where progression is not entirely linear? E.g. The Legend of Zelda. - Given the diverse skills required for each game, could the games be categorised such that the benchmark measures multiple discrete capabilities of models? Fully human-written
VideoGameBench: Can Vision-Language Models complete popular video games? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper presents a benchmark for Vision Language Models on a variety of popular video games from the 90s. Most current models get 0% performance on basically every single task with this setup. The benchmark is very cool, and the engineering and effort that must have gone behind making this are surely impressive. The paper is well written and easy to follow. One of my main criticisms for this paper is that a benchmark were all the current models score 0% on basically every single task is not an interesting benchmark. This surely makes the benchmark future-proof, but the amount of insights it can provide now compared to existing benchmarks with more fine-grained progression systems is lacking. Without a more fine-grained progression system, or easier games where current VLMs can get some amount of performance, the benchmark can offer limited insights compared to what has already been shown in related work, regardless of how cool it is. I am not sure this benchmark is asking the right questions. Do we expect VLMs to be the right format/models for these games? I fully get the reason to test VLM on games requiring reasoning, but a good chunk of these games mostly require accurate perception-to-muscle memory and fast reflexes for humans to perform well, skills that we are likely decades away from unlocking in general purpose multimodal language models. Perhaps Vision-Language-Action (VLA) models, more akin to what is being developed for robotics are the better test target for this benchmark. Or perhaps something more similar to SIMA [SIMA Team, 2024], (which I think would deserve a citation in this paper) would be more appropriate for video games. Or perhaps, the subset of video games chosen for the benchmark should be reconsidered based on what models are being tested. The latency part of this benchmark (full version, not lite) also feels strange, as latency on most of these models is inherently dependent on the GPU being used and/or current server traffic for proprietary models, so a big variable is very much out of your control. At that point you are really benchmarking the quality of the provided service at that point in time and not necessarily the quality of the models themselves. I’m sure with a proper set up and the right amount of compute you could run most present-day models in real time with no latency problems, but this is obviously difficult to achieve if you are not the provider yourself. What happens if in the future you change the GPUs you're running with, or if the API you are using is undergoing more/less traffic? Benchmarks for LLM/VLMs on games are inherently high-variance. One seed is unfortunately not enough. In the current state, this benchmark, while extremely cool and impressive in its own right, seems to be missing a target (the right type of foundation model to be tested) and a purpose (what new questions is it trying to answer? What new insights is it providing?) 1. Why weren’t easier games added to get a more fine-grained average progression? 2. What new insights is this benchmark providing in its current state? 3. Why are VLMs the right target for this benchmark and not VLAs? 4. Using only a frame every 0.5 seconds doesn't seem enough for many of these games. Have other frame rates been tested? 4. Why is latency an important factor here, given that it’s something out of your control for most of the close-source models? What GPUs did you use to test open-source models? How do you account for changes in latency of close-source models APIs? Fully human-written
VideoGameBench: Can Vision-Language Models complete popular video games? Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. # Summary This paper introduces **VideoGameBench**, a multimodal, interactive benchmark encompassing 23 classic video games (10 for testing, 13 for development) across Game Boy and MS-DOS platforms. It adopts a stringent ruleset requiring "raw pixels + only control/objective descriptions". Complementary components include **VideoGameBench Lite** (where the game pauses during agent inference) and an automated **checkpoint-hashing** method for progress measurement—this method leverages frame data from YouTube walkthroughs to quantify completion. Frontier vision-language models (VLMs) achieve extremely low completion rates: the top-performing models reach approximately 0.48% on the full benchmark and 1.6% on VideoGameBench Lite. Additionally, simple auxiliary "practice games" reveal critical deficits in VLMs’ mouse control, dragging precision, and basic navigation abilities. # Strengths 1. **Clear problem framing and motivation**: The paper persuasively argues that current VLMs poorly capture human-centric inductive biases (e.g., perception, spatial navigation, memory management)—capabilities that come naturally to humans. It further justifies real-world video games as a intuitive, ecologically valid testbed for evaluating these understudied skills. 2. **Rigorous ruleset**: The constraint of "no visual overlays, no access to internal game state, and minimal hints" cleanly isolates VLMs’ core visual understanding and action-sequencing abilities. This design discourages over-reliance on bespoke toolchains or game-specific workarounds, ensuring evaluations reflect generalizable capabilities. 3. **Turn-based ablation (VideoGameBench Lite)**: This variant effectively decouples reasoning performance from inference latency. It clearly demonstrates that even without real-time response pressure, VLMs still struggle to progress—highlighting fundamental limitations in reasoning rather than just timing constraints. 4. **Automated progress tracking**: The perceptual-hashing pipeline provides a practical, replay-agnostic approach to coarse-grained scoring. By grounding checkpoints in publicly available walkthroughs, it ensures consistency and reproducibility across different model evaluations. 5. **Broad coverage across genres and I/O modalities**: The benchmark spans diverse game mechanics (controller-based for Game Boy vs. mouse/keyboard for MS-DOS), spatial dimensions (2D and 3D), and genres (action, RPG, strategy, puzzle, racing). This breadth enables comprehensive assessment of VLMs’ adaptability to varied interactive tasks. # Weaknesses 1. **Discussion of checkpoints as performance metrics**: While using checkpoints as a performance metric is intuitively reasonable, results in Table 2 show that most models score 0% on the majority of games. This excessively high benchmark difficulty makes it difficult to reflect models’ true capabilities and leads to a lack of discriminative power between different VLMs. Supplementing the checkpoint system with more granular, phased metrics to capture incremental model progress would significantly enhance the benchmark’s general applicability. 2. **Mechanistic failure analysis**: The identified failure modes (e.g., the "knowing-doing gap") are descriptive but lack mechanistic depth. For instance, the paper does not address: Is the knowing-doing gap caused by VLMs’ inability to map visual perception to action sequences, limitations in context window size (e.g., forgetting prior actions), or poor spatial coordinate mapping? Without such mechanistic insights, the paper provides limited guidance for targeted model improvement. # Questions Q1: Can the authors supplement the benchmark with phased, granular metrics to better capture incremental model capabilities? Doing so would address the current lack of discriminative power and further enhance the benchmark’s general applicability (targeting Weakness 1). Q2: Can the authors conduct a more mechanistic analysis of the identified failure modes? For example, disentangling whether deficits stem from visual perception-action mapping, context window constraints, or spatial reasoning limitations would provide critical insights for model development (targeting Weakness 2). See Summary See Summary See Summary Moderately AI-edited
VideoGameBench: Can Vision-Language Models complete popular video games? Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces VideoGameBench, a benchmark for evaluating vision-language models (VLMs) on 23 curated 1990s video games (split into dev/test sets, including 3 secret games) from Game Boy and MS-DOS platforms. Unlike existing benchmarks, it requires VLMs to complete full games in real time using only raw visual inputs and high-level objectives/controls (no game-specific scaffolding or auxiliary tools). The authors also propose VideoGameBench Lite, a variant that pauses the game during model inference to mitigate latency issues. 1. The paper addresses a critical underexplored gap: evaluating VLMs on full, unmodified real-world tasks (1990s video games) that require integrated human-like abilities (perception, memory, real-time decision-making). Prior benchmarks rely on simplified grid worlds, text-only games, or game-specific tools (e.g., Gemini Plays Pokemon used pathfinding hints), making VideoGameBench a novel "no crutches" evaluation. 2. The benchmark construction is rigorous: it supports multiple emulators (PyBoy, DOSBox), standardizes action interfaces (keyboard/mouse/controller), and uses human-verified checkpoints scraped from walkthroughs. The automated progress-tracking via perceptual hashing is well-validated (e.g., tuning Hamming distance thresholds per game) and avoids subjective manual scoring. 3. The work highlights critical limitations of state-of-the-art VLMs in real-world, dynamic environments—limitations that matter for downstream applications (e.g., robotics, autonomous systems). For example, the "knowing-doing gap" and poor memory management are not captured by math/coding benchmarks but are essential for embodied AI. 4. VideoGameBench provides a standardized, scalable testbed to drive progress in VLM generalization and real-time reasoning. By focusing on 1990s games (with well-documented mechanics and walkthroughs), it avoids copyright or accessibility barriers of modern games while retaining complexity. 1. The benchmark focuses exclusively on 1990s 2D/3D games (Game Boy, MS-DOS), excluding modern game mechanics (e.g., open-world exploration, multiplayer, touch controls) or other classic platforms (e.g., NES, Sega Genesis). This narrows the generalizability of results—VLMs may fail differently on games with distinct interaction paradigms (e.g., point-and-click adventures vs. real-time strategy). 2. There is no human baseline for the full benchmark—only confirmation that humans can complete practice games and first checkpoints. A human performance metric (e.g., average completion percentage, time to checkpoint) would better contextualize VLM failure. 1. The paper mentions that "0% score does not imply no progress"—but how much incremental progress (e.g., moving 1 tile, interacting with 1 object) do models typically make? Providing finer-grained scores for all games (not just Table 10–11) would better illustrate model capabilities. 2. Did you test if VLMs perform better with game-specific prompts beyond basic controls/objectives (e.g., hints about common mechanics like "jump to avoid enemies")? If not, could prompt engineering reduce the "knowing-doing gap"? 3. More model results are needed to include. This paper only provides 7 VLMs. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next