ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 6904
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 4.00 2236
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 4.00 5582
Total 4 (100%) 3.50 3.75 5076
Title Ratings Review Text EditLens Prediction
Through BabyAI Steps: Understanding and Evaluating Grounded Intelligence in LLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper uses the BabyAI gridworld to evaluate the prediction, planning, and decomposition ability of LLMs. Prediction is the ability to map a current state and sequence of actions to the future state, planning is the ability to map a state and subgoal to a sequence of actions, and decomposition is defined as the ability to map instructions into sequences of subgoals. The results show that LLMs are fairly good at prediction, but bad at planning and decomposition. 1. Originality: As the paper acknowledges, there is a large body of work studying the planning/decomposition abilities of LLMs and a large body of work trying to make pretrained models better dynamics models. To me, the main contribution of this work is that it studies the different aspects of sequential decision making--forward models (prediction), higher-level planning (decomposition), and lower-level planning (planning). 2. Quality: The work seems comprehensive, with open-sourced code and environments within BabyAI. There are some interesting experiments in the appendix. 3. Clarity: The work is easy to read and understand. 4. Significance: This paper can conceivably form part of a benchmark to better understand and iterate on LLM sequential decision making capabilities along the three aspects of predict-plan-decompose. 1. While the work proposes three axes along which to evaluate LLM sequential decision making capabilities, it does not propose any insights about how to potentially improve their capabilities along these fronts. 2. Most of the core findings seem too specific to BabyAI gridworlds and not as applicable to LLMs on sequential decision making problems in general (Sections 5.1 - 5.3). 3. Table 1 proposed metrics like manhattan distance that were not reported in subsequent results sections. (If the metric only shows up in the appendix, then it should not be in the main text.) 4. Lines 275-281 mentions 4 different prompting strategies but Line 283 mentions all experiments use Tree-of-Thought (ToT). I would recommend authors to simplify this paragraph and just note ToT is used, without mentioning the other prompting strategies, to simplify and improve clarity. 5. In general, in my opinion, there are too many metrics (7) in the main text of the paper, and I would suggest the authors focus on a smaller group of metrics, with more analysis on each. Some metrics like comprehension rate are confusingly named, and most of them are not adequately explained or motivated. 6. Line 412-414: “Models that perform well on [Predict] implement a shallow arithmetic routine rather than reason about the environment.” Was this conclusion based on an analysis of LLM outputs? Please provide some concrete metrics. Also, arithmetic on the (x,y) positions could be an important part of environment reasoning. How generalizable are these lessons outside of gridworld? 7. Line 415-416: “LLMs that failed on Predict tended to over-engage with the spatial context, for example by validating each action, checking the prompt, or questioning feasibility.” Why is it bad to “over-engage,” “validate actions,” “check the prompt,” and “question feasibility”? 8. Lines 428-433. Instead of listing the failure modes, count their frequency, categorize them, and show them as a pie chart. 9. Section 5.3 seems kind of weak due to the lack of concrete results. Authors say there are no experimental results because “they yielded zero outcomes.” However, is this true even in the easy 4x4 grids? And it was zero success rate over how many trials? 1. One of the main questions that comes to mind after reading about the results -- can LLMs do better planning by leveraging their strength on the prediction problem? Basically, have the prompt encourage an intermediate prediction step before it outputs actions for planning. If authors can demonstrate some performance gains by doing this, it would make this a much more exciting and strong paper, in my opinion. 2. What does “assistance curve integral” intuitively mean? The word “assistance” is confusing. What is assisting what? What module is proposing additional LLM subgoals, and how does this proposing occur? Why is this metric important? 3. Section 5.3: What happens if the image is provided alongside the same textual state description? Is the success rate still zero? I think authors would benefit by reporting results for three modality scenarios: text-only state description, text + image state, and image-only. One can also try a slightly modified image state where the rows and columns are explicitly numbered, to facilitate coordinate understanding. 4. Authors say in Section 5.3 that the VLM is much less precise than LLMs; the VLM uses words like “make a short south adjustment” while the LLM says things like “the red ball is at (4, 14).” How much of this is due to the fact that the LLM was given a list of precise coordinates for all objects on the scene, versus the VLM which was presumably not given any coordinates? 5. What are some findings that you talked about in Section 5 that are applicable to an LLM in a continuous state-space setting? 6. Section 4.3: Why should we expect that an agent with partial observability can perform decent planning? What kind of -- and how much -- interactive execution are they allowed? Fully human-written
Through BabyAI Steps: Understanding and Evaluating Grounded Intelligence in LLMs Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces BABYBENCH, a text-based benchmark built from the BabyAI gridworld to test LLMs’ grounded reasoning and planning. It defines three tasks — predicting action outcomes, planning action sequences to reach goals, and decomposing high-level instructions into subgoals. Experiments show that while top models (GPT-5, Claude 4, LLaMA 3.1, Qwen 3, etc.) perform well on short-term prediction (>80% accuracy), they fail on long-horizon planning (<20% success), with performance dropping further (~10–12%) under partial observability and interactive settings. Also, multimodal models perform even worse when given visual inputs. Overall, BABYBENCH offers a clean, procedurally generated testbed revealing that current LLMs can simulate short-term effects but lack robust spatial and grounded reasoning for long-term planning. Problem is important, benchmark is helpful: The paper tackles a core limitation in current LLM research — the lack of true grounded reasoning and planning abilities. By introducing BABYBENCH, a controlled text-based gridworld benchmark, the authors isolate reasoning from visual and linguistic noise seen in prior embodied tasks. This lets them probe core spatial and planning competence directly. The benchmark sits neatly between abstract text-based planning datasets (e.g., NATURAL PLAN) and complex embodied simulators (e.g., ALFRED, ALFWorld), filling an important methodological gap. Comprehensive evaluation experiments done: The study tests multiple leading LLMs across architectures and scales under standardized settings, ensuring generality of the findings. Chain-of-Thought, Tree-of-Thought prompting are applied to give models the best chance to succeed, and an expert BabyAI OmniBot agent supplies gold-standard plans and automated execution. The graded difficulty levels (easy to very hard) help characterize how performance deteriorates with increasing task complexity, a crucial insight into long-horizon reasoning. Reproducibility and Open Science: BABYBENCH is fully open-source and procedurally generated, ensuring virtually infinite data diversity and eliminating contamination. The authors release the codebase, dataset variants, and environment agents, making replication and extension straightforward. This would potentially be a major contribution to the field and promotes long-term spacial grounded reasoning across models and approaches. Visual degradation is preliminary conclustion: The finding that visual input worsens performance is intriguing but based on only one multimodal model (GPT-5). It’s unclear whether the failure is from inherent limitations of vision-language reasoning or from that model’s specific training and representation biases. Broader testing, with GPT-4 Vision, Gemini, or with vision representation training would help validate it. The vision result should be treated as an early observation rather than a general conclusion. Including even one more baseline or a controlled text-vs-image comparison would strengthen this part of the study. Research such as Jin, Emily, et al. "MARPLE: A benchmark for long-horizon inference." shows contradictary multi-modality improvement for long horizon reasoning. Synthetic Setting and Transferability is unclear: The BabyBench environment’s minimalist “Baby Language” and gridworld setup enable clean analysis but raise questions about generalization. While the benchmark isolates core reasoning, it diverges from natural, noisy, or visually rich environments. The authors acknowledge this and relate it to prior work (e.g., NATURAL PLAN’s similarly low success rates), but it remains unclear how much BabyBench results predict real-world embodied reasoning. The benchmark’s artificial precision may also encourage strategies that would not transfer, such as coordinate arithmetic. A short discussion of how findings might extrapolate to messier inputs or language-based control would provide useful context. Incompleteness of Partial Observability Experiments: The FPI (interactive) evaluation is an excellent addition but limited in scope, only two models were tested, and only on a small subset of missions. This constrains what we can infer about architecture-specific behaviors under feedback and memory constraints. The results (10–12% success) convincingly show difficulty, but more detailed analysis, such as whether failures stem from looping, goal forgetting, or perceptual confusion would add depth. The paper briefly mentions efficiency metrics but doesn’t interpret them, missing a chance to reveal whether models are inefficiently correct or simply fail altogether. These are minor but notable omissions in an otherwise rigorous evaluation. Minor Clarity Issues: The writing is clear overall, but a few details could improve readability. The custom metrics for the Decompose task are only defined in the appendix and would benefit from a short intuitive description in the main text. Similarly, quantitative comparisons among prompt formats (structured vs narrative or JSON) would help contextualize the chosen setup. Lastly, the label “GPT-5” for the multimodal model may confuse some readers; clarifying it as a future or prototype GPT variant would avoid ambiguity. These are small, easily fixable issues that do not detract from the paper’s overall quality. Potential for Training/Fine-tuning: Did you try fine-tuning an LLM (or training a smaller model) on BabyBench tasks to see if planning improves? The current study focuses on zero/few-shot results, but it would be helpful to know if the poor planning is due to lack of exposure or deeper architectural limits. Even a brief discussion or intuition on whether fine-tuning could bridge the gap would add valuable perspective. Vision Input and Failures: How exactly was the visual input provided impact the outcome: as a rendered grid image, textual description, or something else? And what kinds of mistakes did the model make, did it misread spatial layouts or simply ignore the visual cue? Since only GPT-5 was tested, it’s unclear if this reflects a general weakness of VLMs or that specific model. Clarifying whether the issue is coarse spatial representation or domain mismatch (natural vs schematic images) would strengthen this point. Long-Horizon Planning Behavior: In the Tree-of-Thought setup, did models actually branch and explore multiple plan paths, or were they mostly greedy? Examples of near-misses where the model almost solved a task but made a wrong move, would help illustrate whether the failure is due to poor search, limited memory, or state-space complexity. Scaling with Environment Size: As grid size and obstacle count increased, did you observe any behavioral trends, e.g., overly long, nonsensical plans or premature halts? Insights into whether errors stem from sequence length (depth) or environment clutter (breadth) could clarify what truly limits scaling. Fully AI-generated
Through BabyAI Steps: Understanding and Evaluating Grounded Intelligence in LLMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces BABYBENCH, a textual adaptation of the BabyAI environment, to evaluate grounded reasoning in large language models through three components: predict, plan, and decompose. This benchmark aims to isolate core aspects of spatial reasoning and planning. The results show a clear gap between state prediction and goal-directed planning. The authors also test visual variants and find that visual grounding further degrades performance. This work provides a clean and reproducible benchmark for grounded reasoning. It's well executed but relatively narrow in scope and insights. - The benchmark is well-structured and reproducible, with clear task definitions and metrics. - The Predict-Plan-Decompose framework provides a simple and interpretable way to dissect different aspects of spatial reasoning. - The paper’s contribution is mainly infrastructural; it does not introduce new modeling ideas or theoretical insights. - The benchmark focuses on a very simplified, symbolic environment. The more interesting aspects of spatial reasoning lie in the visual information of shapes, sizes, directions, which are difficult to capture through text alone. The contribution of this paper would be more significant if the authors extend their study to visually described environments that involves more spatial structures. - The evaluation lacks precise version information for API-based LLMs. In the current design, the "Decompose" task is evaluated through the OmniBot’s subgoal execution structure, where metrics depend on how well the bot executes or assists with the LLM-generated subgoals. However, this setup seems tightly coupled with the OmniBot’s own subgoal semantics and execution logic, which might limit the generalization of this metric to other tasks or environments. Have the authors considered a more “reference-free” approach for evaluating decomposition ability—one that relies on the LLM itself rather than an external executor? For example, could decomposition be assessed based on whether the generated subgoals lead to successful action sequences that fulfill these subgoals, and whether chaining these subgoals improves success on the original mission compared to direct planning without decomposition? Moderately AI-edited
Through BabyAI Steps: Understanding and Evaluating Grounded Intelligence in LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The contributions of the paper are two fold. First, they introduce BabyBench, a benchmarking environment and suite focused on evaluating the predictive reasoning and planning capabilities of LLMs within a randomly generated BabyAI grid world. Secondly, they provide a Predict-Plan-Decompose (which they name PPD) framework which is an evaluation framework whose goal is to isolate and evaluate atomic grounded reasoning abilities of the LLM models. Using this framework, they run a series of evaluations across multiple LLM models in order to uncover the relationship between predictive capabilities of these LLMs with long term planning capabilities. In doing so, they uncover an asymmetry between predictive and planning capabilities of LLMs. Paper provides a new environments and evaluation suite for analysis of the predictive and planning capabilities of LLMs. Assuming that the code is open sourced, this artifact may be a valuable contribution to the field at large. Furthermore, much of the baselines and different configurations within the paper seem to be in anonymized github link, amplifying the potential for future work to build on top of the results here. For planning tasks in particular, it can be difficult to find benchmarks which can accurately assess the performance of these models clearly. The main concern is that this predict-plan asymmetry derived as the main result of the analysis in the paper seems fairly obvious and covered in prior work. Multiple prior works have already explored the fact that planning is a particularly difficult task for LLMs in multiple different domains [1, 2, 3]. Furthermore, the idea that an LLM can act as a form of a world model to predict the future has also been covered in prior work. [3, 4] So the main "notable dissociation" derived in this work seems to be well covered in prior work. Abstractly, if a model achieves per-step prediction accuracy $p \in (0,1)$, the success probability for $n$-step planning is bounded by $p^n$, which decreases exponentially as $p^n < p$ for all $n \geq 2$. Furthermore, this kind of derogation could also due to a longer-context, since the predict task only requires the model to predict the final state, whereas the plan and decompose task requires the model to synthesize multiple steps. In addition, as mentioned in Section 5.1: Predict does not require the LLM to form a faithful spatial internal representation. In practice, models that perform well on this task implement a shallow arithmetic routine rather than reasoning about the environment. As a result, the model can ignore the spatial context and rely solely on integer additions. [Lines 412-415] It seems like then, the high performance of these models on the Predict task can be partially attributed to the Grid-like structure of the benchmark, where the model can do "a shallow arithmetic routine" to predict the next step. This may explain why the Predict score is so high. For a non-grid world task, one might expect the models to do more poorly on a predict task, where the models cannot exploit this. Furthermore, the experiments are not extensive enough to derive this result robustly. The reasoning for choosing Tree-of-thoughts seems fairly barebones. Appendix E and Table 7 show that the ToT strategy does not perform that much stronger compared to Chain-of-thoughts (CoT) or Few-Shot, and further experiments using these prompting strategies would be strengthen the results shown in the paper. Similarly, the justification for using the Structured format in Section 3.2 is also lacking. Overall, if one could show that these results across models hold across different prompting strategies and output formats, this could lend stronger insight into the abilities of these different LLMs. Presentation-wise, the Figures in the paper are extremely difficult to read and analyze. Figures 2 and 3 and Table 3 have extremely small font and axes which make it hard to interpret the results. There are also several typos across the paper, for example, the title of Section 5.1 (Assymetry -> Asymmetry). [1] Valmeekam, Karthik, et al. "Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change." Advances in Neural Information Processing Systems 36 (2023): 38975-38987. [2] Webb, Taylor, Shanka Subhra Mondal, and Ida Momennejad. "Improving planning with large language models: A modular agentic architecture." arXiv preprint arXiv:2310.00194 (2023). [3] Hao, Shibo, et al. "Reasoning with language model is planning with world model." arXiv preprint arXiv:2305.14992 (2023). [4] Gu, Yu, et al. "Is your llm secretly a world model of the internet? model-based planning for web agents." arXiv preprint arXiv:2411.06559 (2024). The main suggestion to the authors would be to provide evidence for this predict-plan asymmetry existing independently from the quirks of the benchmark or from the long-context concerns. For instance, what is the average prompt length for the model and generation lengths between the predict, plan, and decompose parts of the benchmark? If the number of tokens used across the board is similar, than this concern can be at least partially mitigated. How deep is the ToT depth on average across the board? An experiment using something simpler like CoT or Few-Shot would make this kind of analysis significantly simpler and provide stronger evidence for the main result. Another question is about Section 5.3, where it states that the vision model yielded worse results than the LLM counter part. This section is incredibly barebones, as it does not even have a single reference, table, or figure attributed to it. Did the GPT5 vision model achieve a 0% accuracy across the board? Was the same prompt used between the LLM GPT5 and the VLM GPT5? This seems more like a prompting strategy error rather than a fundamental flaw in the VLM. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next