ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (20%) 4.00 5.00 3945
Moderately AI-edited 1 (20%) 4.00 3.00 1400
Lightly AI-edited 3 (60%) 4.00 4.00 3572
Fully human-written 0 (0%) N/A N/A N/A
Total 5 (100%) 4.00 4.00 3212
Title Ratings Review Text EditLens Prediction
MASTARS: Multi-Agent Sequential Trajectory Augmentation with Return-Conditioned Subgoals Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The authors present MASTARS, a diffusion-driven framework for data augmentation in offline multi-agent reinforcement learning. MASTARS introduces an agent-wise sequential generation mechanism, where each agent’s trajectory is conditioned on those already generated. To further refine the synthetic data, the method incorporates return-conditioned subgoals to selectively reconstruct low-quality segments and applies a value-based ordering scheme that determines which agents are generated first based on their expected returns. By combining these ideas with a diffusion inpainting procedure inspired by image generation, MASTARS achieves better inter-agent coordination without the burden of high-dimensional joint modeling. Empirical results on MPE and SMAC benchmarks indicate that the approach leads to consistent and notable gains over existing offline MARL augmentation baselines. 1. The paper is clearly written and logically organized, making it easy to follow the motivation, methodology, and experimental setup. The visualizations effectively support the explanations. 2. The research problem tackled is highly relevant to offline multi-agent reinforcement learning, where coordinating multiple agents from limited or suboptimal data remains an open challenge. The motivation for addressing this issue is well-grounded and convincing. 3. The proposed approach is methodically designed, building upon established principles of diffusion modeling and inpainting while tailoring them to the multi-agent context. 1. While the paper presents a well-motivated framework, the claimed novelty around the use of diffusion-based inpainting is somewhat limited. Although the authors are the first to explicitly adapt a RePaint-style mechanism to multi-agent data augmentation, similar ideas of trajectory refinement and selective regeneration have already appeared in prior diffusion-based MARL works [1]. As a result, the contribution feels more like a thoughtful integration of existing components, rather than a fundamentally new methodological direction. 2. The paper demonstrates robustness of MASTARS to small random perturbations in value estimates, which is reassuring. That said, offline value estimation is often affected by non-random biases (e.g., limited coverage, distributional shift, or partial observability), not merely random noise. Discussing whether such biases could influence the stability of agent ordering or subgoal selection would strengthen the paper’s practical relevance. Ref: [1] Yuan et al. Efficient Multi-agent Offline Coordination via Diffusion-based Trajectory Stitching. ICLR 2025. 1. Could the authors clarify in what ways MASTARS differs from other offline MARL data augmentation methods, and highlight the relative strengths and weaknesses of these approaches? 2. The paper emphasizes addressing data scarcity in offline MARL via trajectory augmentation. However, all reported experiments combine the augmented trajectories with the original dataset, leaving it unclear what performance can be achieved using the augmented data alone. Could the authors clarify the intended role of the generated trajectories? Are they meant purely as a distributional supplement to the original data, or could they potentially replace sub-optimal original trajectories? Additionally, why is it still necessary to rely on the original sub-optimal dataset if augmentation alone could provide sufficient coverage? 3. In the introduction, the authors note that naively concatenating all agents’ data into a single model for joint generation often leads to sample inefficiency due to high dimensionality. However, in MASTARS, regardless of whether agent-wise or more sophisticated modeling is used, the diffusion model still needs to fit the joint distribution of all agents. Could the authors clarify how their approach actually addresses the high-dimensionality and sample inefficiency issue raised in the introduction? Heavily AI-edited
MASTARS: Multi-Agent Sequential Trajectory Augmentation with Return-Conditioned Subgoals Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes MASTARS, a diffusion-based framework designed to augment MARL trajectories. The method seeks to enhance offline MARL performance by addressing issues related to data quality and coordination among agents. It achieves this by leveraging a sequential agent-wise generation strategy with return-conditioned subgoals. The authors demonstrate that their method provides coordinated, diverse, and realistic multi-agent trajectories that improve the performance of offline RL methods when applied to augmented datasets. 1. The sequential generation with subgoals provides a new way to generate coordinated behaviors across agents. 2. MASTARS shows consistent improvements over several baseline approaches such as DoF, MADiff, and INS, et al. 3. The paper provides a thorough explanation of the proposed method, making it easy to follow. 1. The environments (both MPE and SMAC) are relatively simple and may not fully showcase the potential of MASTARS in more complex, real-world, or large-scale settings. Although existing offline MARL methods typically choose these two environments (perhaps for data convenience), this undoubtedly stagnates the community. I believe that tasks in SMAC and MPE do not sufficiently support the claim of "combinatorial complexity in joint modeling." If the authors could provide experiments in more modern or challenging environments, it would significantly enhance the quality of the paper. I am open to revising my rating should such additions be included. 2. The sequential paradigm is a common approach in MARL, yet the related work section lacks coverage of key works in this area, such as MAT[1], PMAT[2], HARL[3], SeqComm[4], DIMA[5], and SeqWM[6]. [1] Multi-Agent Reinforcement Learning is a Sequence Modeling Problem, NeurIPS, 2022. [2] PMAT: Optimizing Action Generation Order in Multi-Agent Reinforcement Learning, AAMAS, 2025. [3] Heterogeneous-Agent Reinforcement Learning, JMLR, 2024. [4] Multi-Agent Coordination via Multi-Level Communication, NeurIPS, 2024. [5] Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective, arxiv, 2025. [6] Empowering Multi-Robot Cooperation via Sequential World Models, arxiv, 2025. 1. In Table 1, the performance of the "medium" setting is sometime worse than the "random" setting. Similarly, Table 2 shows cases where "good" performs worse than "medium." Could the authors clarify the reasons behind these results? 2. Would the concept of subgoals be applicable in environments that are not fully cooperative, or even in competitive settings? Lightly AI-edited
MASTARS: Multi-Agent Sequential Trajectory Augmentation with Return-Conditioned Subgoals Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a diffusion-based inpainting mechanism for data augmentation in multi-agent offline reinforcement learning. The method aims to enhance both the quantity and quality of offline data to facilitate more effective policy learning. 1. The paper is well-organized and logically structured, making it easy to follow. 2. The experimental results are promising. 1. As the authors mention, there have already been several studies on data augmentation in multi-agent reinforcement learning. However, the Introduction mainly highlights the advantages of MADiff itself, without clearly articulating what specific advantages it has compared to prior works. 2. Although the authors provide extensive experimental validation of the method’s effectiveness, the evaluation is not conducted in more challenging environments such as SMAC2 or settings involving a larger number of agents. 3. The Method section introduces three key components, yet the ablation study does not include an analysis of the third component—Value-based Agent Ordering. 1. What are the main differences and advantages of this work compared to other data augmentation approaches in multi-agent reinforcement learning? 2. Could the authors provide results on more challenging scenarios to better validate the robustness of the proposed method? 3. Why is there no ablation study for Value-based Agent Ordering in Figure 5? Moderately AI-edited
MASTARS: Multi-Agent Sequential Trajectory Augmentation with Return-Conditioned Subgoals Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a novel diffusion-based framework, MASTARS, for generating coordinated multi-agent trajectories through agent-level sequential generation. MASTARS employs a diffusion inpainting mechanism, where each agent’s trajectory is conditioned on the previously sampled agents’ trajectories. This design enables fine-grained inter-agent coordination while avoiding the complexity of high-dimensional joint modeling. To further enhance sample quality, MASTARS integrates reward-conditioned subgoals, allowing the model to exploit valuable data that might otherwise be discarded. By combining agent-level generation with goal-conditioned modeling, MASTARS produces realistic and coherent multi-agent rollout trajectories, thereby facilitating more effective offline multi-agent reinforcement learning (MARL) training. Experimental results on benchmark environments demonstrate that MASTARS significantly improves the performance of existing offline MARL algorithms, validating its effectiveness and generality in collaborative multi-agent scenarios. 1. This paper introduces a return-conditioned subgoal mechanism that allows the model to selectively regenerate only low-quality segments while preserving high-quality behavior segments from the original dataset. This mechanism improves data quality and generation efficiency, reducing unnecessary or ineffective reconstructions. 2. The method employs a diffusion inpainting mechanism to achieve fine-grained coordination. Experiments (Fig.3 and Fig.6) demonstrate that this strategy effectively generates more coherent and collision-free coordinated behaviors. The experimental validation is comprehensive, showing consistent improvements across algorithms. On the MPE and SMAC multi-agent benchmarks, the approach achieves over 30\% improvement compared to the original datasets. 1. MADiff is fundamentally an offline multi-agent RL method. Therefore, the characterization on page 2, lines 69–70, describing it as “an existing diffusion model-based data augmentation method” is inaccurate and potentially misleading. 2. The trajectory acceptance criterion relies on a threshold $\epsilon$ (Eq.6); however, the paper does not provide a theoretical justification for the choice of $\epsilon$, nor does it present any sensitivity analysis to assess how varying $\epsilon$ affects the results. 3. Both sub-goal identification and generation ordering rely on the value function. If the value estimates are unstable or corrupted by noise (as shown in Table 4, where performance drops sharply when $\sigma > 0.1$), the quality of the generated trajectories deteriorates significantly. 4. In lines 71–77 on page 2, the author states that “The independent approach frequently …,” and that “the joint method ….” However, in Fig.1, the results of these two generation methods appear to be the opposite. 1. If trajectory generation is always performed according to descending value estimates, could this potentially over-constrain the behavior of lower-value agents, thereby reducing policy diversity? 2. Is DIRE able to accurately reflect differences in RL trajectory distributions, and is it therefore appropriate to use this metric to assess the novelty of generated samples? 3. MASTARS employs multiple RePaint calls and joint training of multiple models (value, transition, inverse, and reward). Could this approach incur additional inference costs? Lightly AI-edited
MASTARS: Multi-Agent Sequential Trajectory Augmentation with Return-Conditioned Subgoals Soundness: 4: excellent Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper extends data augmentation to multi-agent offline reinforcement learning, identifying and addressing the combinatorial growth of joint state–action spaces and the necessity for coordinated behavior by introducing an inpainting-based diffusion approach. Experimental results on several widely used multi-agent reinforcement learning benchmarks demonstrate that the proposed method achieves a reasonable performance improvement. 1. The proposed method is novel. It addresses the combinatorial growth of joint state–action spaces and the necessity for coordinated behavior through the concept of image inpainting, which is a major challenge in data augmentation for multi-agent reinforcement learning. 2. The paper is well-written, logically organized, and methodologically rigorous despite its complexity. It conducts extensive experiments on several popular multi-agent offline benchmarks (such as MPE and SMAC). The results show that MASTARS significantly outperforms existing baselines across various tasks and difficulty levels, demonstrating the effectiveness of the proposed approach. 3. The overall structure of the paper is clear, with well-organized sections introducing each component of the model and the training process. 4. MASTARS effectively tackles the difficulties of data augmentation in multi-agent reinforcement learning caused by joint distribution and cooperation issues. Its performance on challenging tasks and innovative solutions to complex problems provide valuable insights for the multi-agent reinforcement learning community. 1. The authors’ explanation of Figure 1 seems problematic. They claim that “The independent approach frequently results in collisions as agents ignore each other’s behavior. In contrast, the joint method improves collision avoidance.” However, based on my observation, in Figure 1(c), the trajectories of the green and red agents also collide—and even intersect more severely than in (b). Doesn’t this contradict the statement that “the joint method improves collision avoidance”? 2. Some symbols are not clearly defined. For example, according to my understanding, in lines 247–248, the statement “The generating reverse sampling inner loop starts with index k = K” refers to K as the number of diffusion steps. However, the “diffusion step K” is only mentioned in the Preliminaries section, and it is not explicitly stated that this notation is consistently used in the Method section. It is recommended to restate the definition of the symbol if it has not been used for a long time. 3. As far as I understand, the value function used to determine the augmentation order in this paper is trained independently for each agent. However, in practice, cooperation among multiple agents is common (e.g., in SMAC). Therefore, using an independently trained value function seems somewhat crude. Since this value function plays a key role in critical stages of the sampling process—such as subgoal selection and generation order decisions. I am concerned that its simplicity may become a bottleneck for the proposed method. 4. During the sampling process, the trajectories of agents are generated sequentially. This implies that the generation of the k(l)-th agent’s trajectory is conditioned on all previously generated trajectories (for k( < l)). I am concerned that as more agents are generated, the decision space for later agents may become increasingly restricted, potentially limiting diversity or coordination flexibility. 5. The early stage of the sampling process depends on subgoals determined by the independently trained value function. As mentioned in Weakness 3, this value function does not consider inter-agent relationships (such as cooperation or competition). This limitation might negatively affect subgoal generation and, consequently, the overall sampling process. 1. Could the authors provide further clarification regarding the discussion of Figure 1? (Please refer to Weakness 1.) 2. I am concerned that the value function, which does not account for inter-agent relationships (such as cooperation or competition), may affect the decision of the generation order (Section 3.3). (Please refer to Weakness 3, 5.) 3. During the sampling process, each agent’s trajectory is generated sequentially, meaning that the generation of the k(l)-th agent’s trajectory depends on all previously generated trajectories for k < l. Is there a similar mechanism during training to ensure that the diffusion model learns this dependency? 4. I am curious about how this method performs in visual environments. 5. Why does the method generate trajectories sequentially according to the decided order, rather than sampling directly from the joint distribution of all N agents? Lightly AI-edited
PreviousPage 1 of 1 (5 total rows)Next