ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 4.00 4.00 6034
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (67%) 5.00 3.50 5478
Total 3 (100%) 4.67 3.67 5663
Title Ratings Review Text EditLens Prediction
Adapting Vision-Language Models for Evaluating World Models Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes UNIVERSE, a new VLM-based method for the evaluation of "action recognition" and "character recognition" tasks over world model generations in a visually-rich game (Bleeding Edge). Performance on these tasks are considered as a proxy metric / indicator of world model generation quality. The paper conducts an extensive empirical case study on a visually-rich game (Bleeding Edge). First, the authors investigate performance on ground truth data, validating alignment on valid frame and action sequences, and comparing to simple VLM baselines. Then, the authors investigate design choices such as data composition for tuning the performance of the approach. Lastly, the authors study the evaluation quality of the proposed method on the two tasks on data generated by world models by conducting a human study, where humans rate the outputs of the algorithm. - The idea of using VLMs for semantic tasks evaluation as a proxy for generation quality evaluation is new and interesting. - The method presents multiple options for training / fine-tuning different subsets of parameters. Notably, only updating the projection head (0.07% of params according to the paper) proves highly effective. This piece of evidence could be important for the community I believe. - The paper is overall well-written, easy to follow, and thorough. The appendix is detailed and includes further results and analysis to support the method and the design choices. - For the included experiments, the methodology is generally solid. - Open source code, evaluation data, and human annotation data. `W1`: The paper only presents VLM baselines for world model evaluation, which, according to the authors, were never used before for that purpose. How the method compares to simple world model evaluation approaches such as measuring mean squared error or FVD (between generations and ground truth using a prefix as initial context) using a held out test set, or training control policies under various reward signals (goals) through world model interaction and measuring success via cumulative returns? The paper should include a comparison to existing evaluation baselines. `W2`: The results in Fig. 9 suggest poor accuracy of the method on setting 8, which involves data generated by a smaller model. The authors attributed this to a "resolution mismatch". However, I suspect that since smaller models tend to generate trajectories that deviate more significantly from the ground truth, the results could indicate that the method is perhaps more indicative for correct or higher quality generations, while it may be failing to identify cases where the world model generations diverged from the ground truth, which are the more important subset of cases to capture. This is also related to the rollout filtering concern below. `W3`: Rollout Filtering: I am concerned that the filtering procedure in the paper may filter out poor generation cases, leaving mostly high-quality, simpler (and easier to classify), rollouts. Importantly, in cases where the world model failed to generate the correct dynamics given the sequence of actions (conditioning), the output could be among the filtered sequences. Thus, the results may be indicative for successful generations, but not for poor ones. Given the significant efforts put into this paper, I would suggest to collect a reasonable test set of cases where the generated content (dynamics) deviates from *the ground truth*, i.e., the actual dynamics under the same "raw" actions (for conditioning) starting from the same context, in a deterministic environment. Then, demonstrate that the proposed approach reliably captures these cases. Preferably, also consider inaccurate world models such as the small model used in the paper. Aim for generating both (1) poor reconstructions examples (blur, artifacts, etc.), and (2) incorrect but convincing dynamics that look valid but deviate from the ground truth. `W4`: The proposed evaluator outputs are given in natural language. Given only the description in the paper, it is unclear how this method would be applied in a practical world model evaluation setting (e.g., in a new environment). Specifically: 1. Does the method requires to collect some test set of trajectories from the real environment serving as reference trajectories? 2. How would the questions and answers automatically adapt to a new environment? 3. Suppose that we have generated a set of trajectories (inference) with a world model. The proposed evaluator emitted a corresponding set of answers / strings. How the approach determines the quality of the generations? (suppose each generated trajectory could be long and span multiple 14-frame segments). How performance is aggregated over a single trajectory that spans multiple segments? 4. Is the approach valid only for evaluating very short trajectory segments? 5. What exactly is required for applying this method in an independent application? and how to obtain a single aggregated score that reflects the generation quality? The paper should includes the answers to these questions clearly. `W5`: The method seems to rely heavily on the "Description Generation" step (line 188), for converting sequences of frames and raw actions to language form. It is non-trivial to assume that such an approach would work reliably in general domains. Here, inferring character movement direction in third person view could be much easier than intricate finger movements in first person view for example. It is unclear how reliable this step is, i.e., how reliably it captures cases where the actions are not aligned with the generated frames (which typically occurs when the world model is given an action sequence that leads to a novel dynamics behavior that was not sufficiently explored / observed and thus is not properly represented in the training data). I suggest to carry a specific study on this aspect, similarly to that I proposed in `W3` above, and show that (hopefully) the proposed approach is indeed reliable. `W6`: Claiming for effective world model quality evaluation method, as the paper title and abstract suggest, requires several diverse benchmarks (environments / games) with diverse dynamics (between benchmarks), studies that show evaluation quality over longer horizons (could be aggregated scores over 14 frame segments in your case), and more extensive studies, as suggested above. Demonstrating that a method works on a single game (although with multiple "stages" / "maps") does not sufficiently support such a general claim. `W7`: While the motivation and claims suggest a very general method for world model evaluation, e.g., "a method for adapting VLMs to rollout evaluation under data and compute constraints" (abstract) and "establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for world models." (abstract), in practice the approach is only valid for video environments, and only evaluates action and character recognition over very short horizons. I expect that the claims would represent the method more accurately, i.e., limit the scope of the claims. `W8`: "Assesses whether generated sequences accurately reflect the effects of agent actions ***at each timestep***" (line 173). This claim is inaccurate, as the method was only validated at a coarser action granularity (language description). `W9`: line 1454, the variable letter `R` was abused, re-defined for both recall and the set of response n-grams. `W10`: line 1768: large-scale and small scale models swapped? `W11`: > The task distribution favors Action Recognition (αAR = 0.8) ***due to its stronger causal grounding*** (line 255) Please support this claim, it is non-trivial. `W12`: The binary vs. multiple-choice vs. open-ended comparisons are presented throughout the results. First, it is not clear enough that these are complementary sets of QA (the comparison could also suggest that you compare different formats of presenting the Q/A). Second, while presenting the results for all QA types provide empirical insight, I think it is not the most important aspect of the work and thus it is somewhat confusing as it takes a very significant real-estate in the main paper. I believe that the result would be clearer by presenting aggregated scores first, to indicate overall performance clearly, in the same way that the final generation quality evaluation (score) is given, and present the results for the separate QA types later (or in the appendix). To be clear, I do not consider this as a major weakness. `Q1`: in "Rollouts Generation" (line 1774), how exactly the action sequence used for conditioning is determined? How the 1 sec context was chosen? `Q2`: "d = Describe(c(1:L), m)." (line 1223). How exactly `Describe` works? Is it trained specifically on data of this game? Is it necessary to train such model in order to use the method to evaluate world models in practice? `Q3`: How the performance of UNIVERSE on character recognition in Figure 1 align with those in Figure 2 (right)? i.e., 99%+ vs. 84%? Fully human-written
Adapting Vision-Language Models for Evaluating World Models Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a method for training a VLM to evaluate the action and character consistency of a world model. The main contributions are: 1) A large-scale gameplay dataset for training and evaluating world models. 2) A pipeline for generating inconsistent actions and characters from the ground truth data along with question-answer pairs. 3) A parameter-efficient method for training a VLM to evaluate the action and character consistency of a world model. Overall this problem is very timely given recent interest in world models. While this work focuses on a single game, they extensively ablate the different components of their method and show that their approach is effective and likely generalizable to other environments. Particularly the authors thoroughly ablate the training data composition and VLM training method. Additionally, the authors provide a thorough evaluation of the model's generalization to out-of-distribution data by evaluating the model on held-out environments and rollouts from eight different world models. The main weakness, as mentioned by the authors in the limitations section, is whether this method can be applied beyond video games to other environments. The training data construction method relies on action logs from the game which are not available for other environments and might be costly to acquire for a large set of environments. The problem is likely compounded for real-world simulators used for robotics and other embodied agents. 1) How do you scale Universe to a wider set of gaming environments? Would this require partnerships with more game studios? (likely costly and limiting) 2) You mention that applying Universe to long-horizon trajectories is an open problem. Is this simply due to the training samples being 14-frames long? Or are the models less capable of reasoning across longer image sequences? Fully human-written
Adapting Vision-Language Models for Evaluating World Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a framework for evaluating the quality of video rollouts generated by world models. The authors propose a structured evaluation protocol comprising two identification tasks: Action Recognition (AR), which assesses whether generated videos align with input action sequences at the timestamp level, and Character Recognition (CR), which evaluates entity identity consistency over time. Each task is evaluated through three question-answering formats (binary, multiple-choice, and open-ended). UNIVERSE is built upon the PaliGemma 2 3B architecture and adapts the model by fine-tuning only the multimodal projection head (0.07% of parameters). The authors conduct extensive experiments totaling 5,154 GPU-days and validate their approach through human evaluation across 8 environment settings. 1. **Well-motivated problem with practical importance:** The paper addresses a genuine need in the world model community for fine-grained, semantically-aware, and temporally-grounded evaluation methods. The limitations of low-level metrics like FID/FVD are well-articulated. 2. **Structured and actionable evaluation protocol:** The AR/CR task decomposition with multiple QA formats provides a clear, operationalizable framework that other researchers can adopt and extend. This structured approach is more systematic than ad-hoc human evaluation. 3. **Comprehensive experimental validation:** I appreciate the thoroughness of the ablation studies examining fine-tuning configurations, frame sampling strategies, supervision amounts, and data mixing ratios. These provide valuable empirical insights for practitioners. 4. **Rigorous human evaluation:** The human study design (240 rollouts, 1440 QA instances, multiple annotators with adjudication, Cohen's κ reporting) demonstrates careful attention to validation methodology. 5. **Resource efficiency:** Achieving competitive performance while fine-tuning only 0.07% of parameters is practically valuable for researchers with limited computational resources. 1. **Missing comparison with existing evaluation benchmarks:** My primary concern is that the paper motivates UNIVERSE by suggesting existing benchmarks lack certain capabilities, but I could not find any direct experimental comparison with methods like VBench or EvalCrafter on the same data. I would strongly encourage the authors to provide such comparisons, specifically measuring how different evaluation protocols correlate with human judgments on identical rollouts. This would substantiate the claim that UNIVERSE offers necessary improvements over existing tools. 2. **Unclear positioning of methodological contribution:** The core technical approach—fine-tuning only the projection head—appears to be a well-established practice in VLM training literature (often called the "alignment stage"). While applying this effectively to a new domain is valuable, I suggest the authors clarify whether this represents a novel discovery or an effective application of known techniques. This would help readers better understand the nature and scope of the contribution. 3. **Incomplete coverage of foundational related work:** I notice that the paper does not cite some foundational work in the "model-as-judge" paradigm, particularly Zheng et al. (2023) "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," which pioneered using models as automatic evaluators. Since UNIVERSE directly extends this paradigm to multimodal evaluation, I recommend including this reference to provide clearer intellectual lineage. 4. **Limited scope of generalization validation:** While the paper evaluates 8 settings, all data comes from a single game (Bleeding Edge) environment. I observed that performance degrades noticeably at lower resolutions (Setting 8: 56.55% on AR), suggesting sensitivity to distribution shifts. I would encourage testing on more diverse world model types (e.g., autonomous driving, robotics) to better understand generalization capabilities. 5. **Potential circular reasoning in baseline comparisons:** The "task-specific baselines" mentioned in the paper appear to be other variants trained by the authors rather than established external methods. While these comparisons are informative for understanding different adaptation strategies, I suggest clearly distinguishing between internal architectural ablations and comparisons with established community benchmarks. 1. **Comparison with existing benchmarks:** Could you provide experimental results comparing UNIVERSE's scores with those from VBench and EvalCrafter on the same set of rollouts? Specifically, I would be interested in seeing correlation coefficients (Spearman/Pearson) between each method's scores and your collected human judgments. This would help clarify how UNIVERSE's evaluation protocol relates to and potentially improves upon existing approaches. 2. **Methodological positioning:** Could you clarify whether the projection-head-only fine-tuning approach was explored as a known technique from VLM training literature, or whether you discovered it independently through your experiments? If it builds on known practices, how does your application differ or extend them? 3. **Generalization to other domains:** Have you conducted any preliminary experiments on world models from domains other than gaming (e.g., robotics, autonomous driving)? What challenges do you anticipate for adapting UNIVERSE to these domains? 4. **Data efficiency analysis:** Given the large computational investment (5,154 GPU-days), could you provide more insight into which experiments contributed most to your key findings? This would help the community understand resource allocation for similar adaptation tasks. 5. **Baseline clarification:** When you mention achieving "comparable performance to task-specific baselines," could you clarify which specific external methods or benchmarks these refer to, versus internal architectural variants? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next