ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 2.00 4.50 5216
Fully human-written 2 (50%) 3.00 4.00 2468
Total 4 (100%) 2.50 4.25 3842
Title Ratings Review Text EditLens Prediction
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the decision-making problems in mixed-motive scenarios where cooperation and defection coexist. The paper provides a method, taking into account the nested interaction between agents' (including opponents and allies) intents and strategies. Without relying on other agents’ private information, the method hierarchically infers opponents’ intents and intent-based latent strategies, and predicts their influence on the behaviors of allies. The experiments on Gym-JSBSim, SMAC, and GRF validate the superior effectiveness of the proposed method over existing model-free and model-based approaches. The paper focuses on both intent-based strategies and the interactions among agents’ intents and strategies. It proposes a transformer-based hierarchical opponent inference and decision-making method within the reinforcement learning framework, and extensive experiments across three environments verify its effectiveness. Overall, the paper presents a substantial amount of work with comprehensive experiments and detailed methodological development and contributes valuable insights to the study of intent modeling and opponent inference in mixed-motive multi-agent systems. 1. It is not easy to follow this paper because of the inconsistent use of symbols and technical terms. See details in **Questions** 2. In mixed-motive games, agents should consider both allies' and opponents' intents and strategies, while the proposed method insufficiently addresses allies' intents and strategies. It may lead to a failure of coordination within the team. 3. The introduction does not include any citations. From the introduction, it is unclear how this project is related to mixed-motive games. Please further refine and reorganize the introduction section. ##### 1. At the high-level, the observation model $p_{\theta_i}$ predicts observations $O_{opp,i,t}$ based on intents $z_{I,i,t}$, while the observations are in turn used by $q_{\phi_I}$ to infer intents. With such coupling, model errors may gradually accumulate. How do the authors address this issue? So does the low-level. 2. In line 269, why does the policy take into account allies' intents and latent strategies? In mixed-motive games, individuals need to model not only their opponents but also consider the behaviors of their allies in order to achieve better coordination. 3. Do $p_{\theta_I}$ in line 257 and $p_{\theta_L}$ in line 266 predict observations rather than trajectories? 4. In section A.3, opponents' relation position, relative velocity, angles and distance relative to others are included in observation, which is inconsistent with the statement in line 224. It says opponents' actions are not observable. 5. There seems to be no fundamental difference between stage 2 and stage 3 in subsection 3.1. 6. The notation used in the paper is somewhat confusing. For example, in line 186, $ a_{i,t}\sim \pi(|o_{i,t}, z_{I,i,t}, z_{L,i,t}) $ shows agent $i$'s action only depend one cooperative agents' $o_{i,t}$, $z_{I,i,t}$ and $z_{L,i,t}$, with $N$ is the number of cooperative agents. It is inconsistent with "cooperative agents update their policies based on trajectory and observations and inferred opponent intentions and strategies" given in section 3. Please modify the problem formulation and unify the notation. Fully human-written
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes an opponent modeling method, i.e., H2IL-MBOM, that integrates multi-intention and latent strategy inference into a world model. H2IL-MBOM combines high-level intention inference with low-level strategy prediction to deal with the non-stationary dynamics in multiagent environment. H2IL-MBOM is combined with PPO, which results in MSOAR-PPO. The effectiveness of the method is evaluated on several multi-UAV games. - the idea of employing world model to the field of opponent modeling is interesting. - The paper is poorly written. For instance, a lot of concepts, e.g., intentions, mental state, strategies, are lack of clear definition. Many figures in the experimental section are hard to interpret, and the captions are not informative (e.g., Figure 4.). Many abbreviations make the paper very hard to follow. - Lack of novelty of the proposed method. Similar ideas (e.g., reason about latent strategies based on teammates’ historical responses) have been intensively explored in previous opponent modeling methods, such as [1-3], which are missing out in the Related work section. - Most of the baselines are not targeting opponent modeling methods, e.g., MADDPG, MAPPO. It is necessary to compare with SOTA opponent modeling methods, such as [1-3]. [1] Greedy when sure and conservative when uncertain about the opponents, ICML 2022 [2] Conservative offline policy adaptation in multi-agent games, NeurIPS 2023 [3] Opponent modeling with in-context search, NeurIPS 2024 - What exactly do you mean by intentions, strategies, mental state? Could you have a clear definition of these concepts? Fully human-written
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces H2IL-MBOM, a framework for opponent modeling designed to address non-stationarity in multi-agent adversarial environments. The method's core is a hierarchical world model that mimics human cognitive processes by decomposing the complex task of reasoning about an opponent into two levels. At a high level, the model infers an opponent's macro "intention" by analyzing their historical trajectories. Subsequently, at a low level, it uses this inferred intention as a condition to deduce the specific "latent strategy" the opponent is employing, taking into account the reactions and movements of allied agents. This framework is implemented through a complex neural architecture based on Transformers and Hypernetworks (HyperHD2TSSM) and is used to guide a PPO-based reinforcement learning agent. The authors report that their method demonstrates superior performance compared to various baselines in several testbeds, including multi-UAV combat, the StarCraft Multi-Agent Challenge, and Google Research Football. - The paper introduces a novel approach to opponent modeling inspired by human cognition, decomposing the complex problem into a two-level hierarchy of high-level "intentions" and low-level "latent strategies". This provides a structured and theoretically-grounded new perspective for the field. 1. This paper, in its current state, is difficult to accept. The core issue is not just a matter of style, but a fundamental lack of clarity in its presentation that prevents a proper scientific review. The manuscript is plagued by a host of minor yet cumulative errors that suggest a lack of care in its preparation. For instance, citations are not properly formatted (lacking \citep or \citet), leading to overlaps with the text. There are basic punctuation errors (e.g., missing periods on lines 199 and 218), inconsistent formatting (the acronym HJLGT is sometimes italicized, sometimes not), and poor typesetting (some words are hyphenated across lines between 249-269). Furthermore, the figures are poorly executed; the text in Figure 3 is very small, while the architectural diagrams in Figures 5 and 6 are so cluttered they seem designed more to showcase the model's complexity than to explain it. The overall writing quality, with its convoluted sentence structures and excessive jargon, resembles unedited text generated by a large language model, a practice that should be acknowledged if used. 2. This poor presentation directly obscures the methodology. The main body of the paper has been effectively "hollowed out," with critical information relegated to the appendix. For example, MSOAR-PPO is listed as a key contribution, but its mechanics are entirely absent from the main methodology section. Similarly, the dimensionality of the core latent variables for "intention" and "strategy" — a crucial implementation detail — is only found in a table in the appendix. The reader should not have to be a detective, piecing together the core method from scattered fragments. This forces a reviewer to question the paper's central claims. The entire framework rests on a rigid two-level hierarchy where "intention" dictates "strategy," a strong cognitive assumption presented without justification. The paper also fails to provide evidence that the learned latents, $z_I$ and $z_L$, are actually disentangled. The t-SNE plots are insufficient as they only visualize clustering, not semantic meaning. A more rigorous analysis, such as performing interventional experiments (e.g., fixing the "intention" latent while varying the "strategy" latent and observing the impact on generated trajectories), is needed to validate that these variables meaningfully represent their claimed concepts. 3. The experimental evaluation is similarly unconvincing. The decision to place the results on standard benchmarks (SMAC and GRF) in the appendix is a major flaw that undermines the paper's claims of generalizability; these should be in the main paper. The primary UAV experiment relies on comparisons against opponent modeling baselines (e.g., ROMMEO, PR2) that perform catastrophically. Their complete failure strongly suggests a lack of proper hyperparameter tuning for this complex, continuous-control environment. For a fair comparison, the authors must either provide evidence of a thorough tuning process for these baselines or include stronger, more suitable ones. The ablation study, while showing that components are useful, does not justify the model's immense complexity. The fact that simpler variants (e.g., the only_intentions model from Fig. 4a and especially the transformer_shareadd model from Fig. 4f, which performs nearly identically to the full model) are still effective raises a critical question about the cost-benefit trade-off. The authors should provide a discussion justifying why the marginal performance gain of their full architecture warrants its significant complexity over these simpler, yet competent, alternatives. 1. he prior distribution for an agent's latent state (e.g., $p(z_{I,i,t}|...)$ on page 5) explicitly conditions on the latent states and actions of its neighbors ($z_{I,n_i,t-1}, a_{n_i,t-1}$). How is this neighbor information accessed or communicated between agents during execution, especially within what is described as a decentralized execution paradigm? 2. Appendix A.9 states that the value function for MSOAR-PPO "does not incorporate respective opponents' observations," distinguishing it from MAPPO. However, the policy is conditioned on these observations ($O_{opp,i,t}$). In a centralized training paradigm, why would the critic be deprived of information that is available to the actor, as this typically undermines the core benefit of CTDE? 3. In the scalability tests (Appendix A.11), a 4 vs. 6 engagement resulting in a 3:4 survival rate is described as a success where the smaller team "destroys one more aircraft than the blue team". Could you clarify this interpretation, as a 3:4 result (Red:Blue) means the 4-agent team lost one agent while the 6-agent team lost two, which is not an equal or better kill-death ratio per capita (0.25 vs 0.33 losses per agent)? 4. The reward functions in Appendix A.3 are highly complex. Specifically, the formulas for "reward regarding position of planes" (Eq. 2) and "reward regarding velocity of planes" (Eq. 8) appear to use a very similar calculation for the variable `dd` based on antenna and aspect angles (ATA, AA). Could you explain the rationale for using this angular metric to modulate both a position-based and a velocity-based reward? 5. In the H2TE module (Appendix A.5.1, Eq. 13), the weight $w_{H,i,j,t}$ for agent `i`'s view of opponent `j` is generated from $H_{j,t}$, which is defined as the observation history of opponent `j` relative to *all* N cooperative agents. How does an individual agent `i` get access to the opponent's historical observations relative to its teammates during decentralized execution? 6. The problem is defined as a partially observable one where agents only have local observations. However, the key transition model `HJLGT` (Appendix A.6) and the prior distributions both explicitly use neighbor states and actions as inputs. Does this imply that agents are assumed to have perfect, noise-free observation of their immediate neighbors' states and actions, and if so, shouldn't this be stated as a key assumption in the problem formulation? Lightly AI-edited
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies cooperative-competitive MARL. The paper proposes to learn both intention and strategy representations of opponents and utilizes such information to update beliefs and policies/strategies of the agents involved in the game. The paper conducts experiments in several MARL benchmarks including Gym-JSBSim, SMAC, and GRF, in comparison with multiple baselines including both opponent-model free and opponent-model-based MARL algorithms. 1. The paper introduces a new approach to model opponents' decision making (with the goal is to separate intentions from strategies) and integrate it into strategy/policy learning of the agents in the game. 2. Experiments show promising results as the proposed method is shown to perform better than other strong baselines in various benchmarks. 1. Writing and Clarity The paper is not well written. In particular, Section 3—the main technical section—requires substantial revision. The section consists of long, dense paragraphs that lack a clear and structured explanation of the proposed model. The heavy use of acronyms and lengthy equations further obscures the main ideas rather than clarifying them. More importantly, the intuitions and justifications behind the model’s components, as well as how they connect, are not clearly explained. A concise, intuitive description of the model and its motivation is necessary to make the paper accessible and convincing. 2. Separation Between Intention and Strategy The paper needs stronger justification for the proposed separation between intention and strategy. The key question is how the proposed components actually capture intention and distinguish it from strategy prediction. The manuscript does not clearly explain what specific mechanisms or characteristics of H2TE-MITD and LHTE-MLTD enable this distinction. The authors should provide clearer explanations to support the claim that these modules can meaningfully separate intentions from strategies. 3. Cooperative–Competitive Setting Although the paper discusses a mixed cooperative–competitive setting, the proposed approach appears primarily designed to address competitive interactions. It remains unclear how the model effectively handles both cooperation and competition within the same framework. 4. Baseline Performance and Reliability of Results The reported performance of baseline methods, such as MAPPO on SMAC environments, is significantly lower than in established works (e.g., the recent HAPPO paper). This discrepancy raises concerns about the experimental setup and the reliability of the reported results. The authors should verify their implementations, hyperparameters, and training conditions to ensure a fair and credible comparison. 5. Supplemental Material The supplemental zip file could not be opened, preventing further examination of the additional materials. Please ensure that the supplementary files are correctly packaged and accessible. Please address my concerns raised in Weaknesses. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next