|
Protein as a Second Language for LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes Protein-as-Second-Language (PSL), a training-free framework that enables large language models to interpret protein sequences as a “second language.” Instead of fine-tuning, PSL performs retrieval-based in-context learning by constructing bilingual contexts that pair amino-acid sequences with natural-language descriptions. The authors build a 79K protein–QA corpus via Gene Ontology–based functional grouping, MMseqs2 clustering with semantic deduplication, and automatic QA generation using DeepSeek-R1 across four question types. During inference, PSL selects relevant examples based on sequence homology and semantic similarity, forming adaptive prompts for frozen LLMs (GPT-4o, Qwen, Mistral). Across three benchmarks (ProtDescribe, Protein2Text-QA, Mol-Instructions), PSL achieves up to 17.2% ROUGE-L improvement, outperforming domain-specific models like ProLLaMA-7B and BioT5+, and reframes protein understanding as retrieval-driven bilingual reasoning rather than supervised fine-tuning.
This paper introduces a conceptually novel and computationally efficient framework that enables large language models to understand protein sequences through bilingual contextual reasoning without any fine-tuning. In addition to the framework, the authors construct a large-scale bilingual protein–text corpus containing 79,926 sequence–question–answer pairs, which serves as the foundation for retrieval-based in-context learning and systematic evaluation.
1. The bilingual corpus is constructed using Swiss-Prot as the primary data source, while the evaluation datasets are also derived from or highly overlap with Swiss-Prot. The paper does not provide sufficient details on how potential data leakage or overlap was prevented, which raises concerns about the fairness of evaluation.
2. Each inference involves a retrieval step to construct query-specific contexts, but the computational overhead and latency introduced by this process are not analyzed. The practical efficiency of the framework therefore remains unclear.
3. The method assumes that proteins with high MMseqs2 similarity share similar functional or semantic contexts. However, this assumption may not always hold, especially for multi-domain proteins. A more critical discussion or ablation on this assumption would strengthen the justification.
4. The experimental comparison includes only two domain-specific baselines, ProLLaMA-7B and BioT5+, which may not be sufficient to establish broad effectiveness. Including more diverse or fine-tuned protein LLMs could improve the reliability of the conclusion.
5. The framework appears to treat protein sequences and text jointly as input without a dedicated modality projector or alignment module. While this simplifies the design, it may not fully exploit cross-modal complementarities, and more structured feature integration could further enhance performance.
1.Could the authors clarify whether the constructed bilingual corpus overlaps with the evaluation datasets? Since both the corpus and the benchmarks seem to originate from Swiss-Prot or related sources, it would be important to specify how potential data leakage was prevented to ensure fair evaluation.
2.The paper transforms Swiss-Prot annotations into multiple QA formats rather than using the full annotations directly. What is the motivation for this choice, and would incorporating broader and more complete biological knowledge lead to more stable contextual enhancement?
3.The proposed framework involves a retrieval step for each query to build adaptive bilingual contexts. Could the authors discuss the computational overhead introduced by this process and its impact on inference time and scalability compared with fine-tuned models?
4.The exemplar selection process is described as combining both sequence homology (via MMseqs2) and semantic similarity between QA pairs. Could the authors elaborate on how these two signals are integrated into the final retrieval score or ranking criterion?
5.The method assumes that proteins with high MMseqs2 similarity share similar contexts. Have the authors considered using alternative similarity measures, such as embedding-based similarity from ESM or structure-based similarity from ProTrek, and could they provide related ablation results?
6.In the comparison with ProLLaMA and BioT5+, were these models fine-tuned on any part of the proposed corpus (like RAFT), or were their publicly released parameters used directly for inference? Please clarify whether additional training or adaptation was performed to ensure a fair comparison. |
Fully AI-generated |
|
Protein as a Second Language for LLMs |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Protein-as-Second-Language (PSL) framework is a method for protein function understanding using large language models (LLMs) without fine-tuning. The approach reformulates amino acid sequences as symbolic language and uses adaptive, bilingual context construction (sequence + natural language) to enable zero-shot reasoning. A curated dataset of ~80k protein–QA pairs spanning functional, descriptive, and reasoning tasks supports the method.
- Introduces the idea of treating protein sequences as a "second language" for LLMs, bridging symbolic biological and natural language reasoning.
- The bilingual corpus is diverse (79k QA pairs across 4 types), functionally rich, and biologically balanced.
- Works across frozen LLMs (3B–14B) and improves both open-source and proprietary models in zero-shot settings.
- No wet-lab or structure-level validation is presented; success is only measured by text-based QA metrics (e.g., ROUGE-L).
- How does the method perform on out-of-distribution or rare protein families, especially those absent from the QA corpus? |
Heavily AI-edited |
|
GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces GenCape, a novel generative framework for Category-Agnostic Pose Estimation (CAPE) that learns structural relationships directly from support images. The key innovation lies in automatically inferring keypoint connectivity patterns (soft adjacency matrices) without requiring predefined skeleton graphs, keypoint identifiers, or text descriptions. The framework comprises two main components: The Iterative Structure-aware Variational Autoencoder (i-SVAE) learns instance-specific graph structures from support features using variational inference, with iterative refinement across decoder layers. The Compositional Graph Transfer (CGT) module then dynamically combines multiple graph hypotheses through Bayesian fusion and query-guided attention mechanisms.
This is the first CAPE method to achieve fully automatic learning of structural relationships from image support sets, removing the need for predefined skeletons, keypoint IDs, or text descriptions, which enhances both generality and practical deployment. The i-SVAE approach models structural uncertainty through variational inference, demonstrating superior robustness compared to discriminative methods like SDPNet, particularly when handling support-query mismatches or occlusion scenarios. The method consistently outperforms various baselines on MP-100, with particularly notable advantages under strict thresholds (e.g., PCK@0.05).
1. While the paper claims to evaluate cross-supercategory generalization on MP-100, the definition of supercategories appears inconsistent with the original MP-100 benchmark and prior CAPE literature (e.g., CapeFormer).
The original MP-100 dataset is widely understood to group categories into four high-level semantic domains: human body, human/animal face, vehicle, and furniture. However, this work instead uses a finer-grained 8-supercategory split (e.g., separating Felidae, Canidae, Ursidae as distinct supercategories), which blurs the line between "cross-category" and "cross-subcategory" generalization. For instance, transferring from Felidae to Ursidae involves structurally similar quadruped animals with comparable keypoint layouts—this is arguably intra-domain transfer, not the more challenging cross-domain shift (e.g., chair → human) that truly tests category-agnostic capability. Worse, the paper does not include any cross-domain transfer between the canonical four domains (e.g., furniture → human body). This omission is critical, as if the method cannot generalize from chair to person, its claim of "structure-inductive" modeling is significantly weakened.
2. The paper does not provide any comparison of computational efficiency, such as inference time, FLOPs, model size, or throughput, against baseline methods like GraphCape or CapeFormer. While it introduces additional modules (i-SVAE and CGT) that likely increase computational cost, no quantitative analysis or efficiency trade-offs are reported.
Figures 4 and 5 reveal some localization errors. What is the primary cause of these errors, structural inference failures or visual feature ambiguity?
Could the authors provide quantitative analysis of these failure modes?
What is the method's robustness to scale variations, cropping, and other common transformations?
Have additional evaluation metrics beyond PCK been considered, such as AUC or other standard pose estimation metrics? |
Fully AI-generated |
|
GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The submission focused on the task of category-agnostic pose estimation with few annotated example images. Specifically, the authors propose a novel generative-based framework named GenCape to estimate keypoints without additional textual descriptions or predefined skeletons. The authors propose an Structure-aware Variational Autoencoder to infer instance-specific adjacency matrices from support features, and also propose a Graph Transformer Decoder to progressively refine the estimated results. The experiments are conducted on a large-scale benchmark datasets, indicating the effectiveness of the proposed novel framework.
1. The task of category-agnostic pose estimation is interesting and fundmental for extending the category number of pose estimation.
2. The idea of using generative framework is reasonable and makes sense.
3. The proposed Structure-aware Variational Autoencoder and Compositional Graph Transfer are novel and effective to model the pose structure information.
4. The performances of proposed framework are shown on large-scale benchmark dataset, and outperform SOTA dramatically.
5. The experimental analyses are comprehensive and clear.
1. As discussed in Line 69-74, the support images may contain severe occlusions or incomplete annotations, and how does the proposed method address this issue? E.g., the query image has 2 occluded keypoints, while the support image has another 3 occluded keypoints. Can the proposed method estimate all the visible keypoints?
2. What's the complexity of proposed framework? It seems to be about O(M^2). Is the proposed method cost-effective?
3. Can the proposed method produce diverse results based on VAE sampling? How to understand the contradiction of diversity v.s. consistency in the proposed VAE-based method?
4. Closely related works are missing in the related works. 
> 1. @inproceedings{chen2025weakshot, title={Weak-shot Keypoint Estimation via Keyness and Correspondence Transfer}, author={Chen, Junjie and Luo, Zeyu and Liu, Zezheng and Jiang, Wenhui and Li, Niu and Fang, Yuming}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year={2025} }
>
> 2. @inproceedings{lu2024openkd, title={Openkd: Opening prompt diversity for zero-and few-shot keypoint detection}, author={Lu, Changsheng and Liu, Zheyuan and Koniusz, Piotr}, booktitle={European Conference on Computer Vision}, year={2024} }
See Weakness. |
Fully human-written |
|
GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This works suggests solving CAPE by progressivley inferring instance-specific keypoint relationships from the support, instead of using predefined annotated adjacency matrices. The authors also introduce the Compositional Graph Transfer module that aids with incoroporating the query features, thus allowing for less reliance on inferred keypoint relationships from the support. This makes the model more robust to occlusions and discepencies between the support and query. The new GenCape approach is tested on the known MP-100 benchamrk, achieving SOTA results.
1. The paper is written in a clear language that was easy to follow.
2. While predicting the keypoints relations from the data is not new, the novel i-SVAE and CGT components suggest some interesting insights that might interest the CAPE community.
3. The suggested approach achieves SOTA while dropping the need for predefined annotated data (keypoint connectivity) that was used by previous methods.
4. Other than the main experiment in Table 1, the design choices are justified in the ablations conudcted (Table 4, Table 5 and Table 6).
My main issues are with the presentation, not with the method. After resolving these issues, I would positively consider increasing my rating.
1. The technical text (mostly) in the Methods section:
Line 157: M_C is not defined in the right place. Move it to this sentence.
Line 179: remove 1 between.
Lines 190-195: i-SVAE also infers graphs from the support. And as you mention, there is sometimes a discrepency between the support and the query. So I'm not sure that i-SVAE alone will solve the issue mentioned in these lines. However, i-SVAE combined with CFG will.
Line 208: F_s^(l-1) is not defined properly - what is its value where l=1?
Line 212: in the second row of Equation 1, should this be F_s^(l-1) or F_s^(l)?
Equations 3 and 4: A^~(l) is defined twice?
Equation 6: F_s^(l) is in the input and output
Line 244: A^~(l) - different notation compared to Equation 3. Should be/not be in bold?
Line 248: This is not clear. Do keypoint locations are predicted in each layer? Each layer of what? the Graph Transformer Decoder? Clarify what is the output of each layer in the Graph Transformer Decoder.
Line 275: missing ')' in mu^(l
Line 377: "More detailed comparisons.": should be a sperate paragraph?
2. Figures:
Figure 2: Consider adding CGT to Figure 2 (A^l_fused is not enough to easily follow).
Figure 3: It is challenging to interpret the adjacency matrices. Consider showing the “best” links from the adjacency matrix as colored edges in your prediction.
Figure 4: last column is AutoCape.
You mentioned Text Graph Support as an approach for CAPE. Will fusing text with your approach might increase performance? Could you hint on how whould you incorporate text as a future work with your approach (maybe also infer it from the support?). |
Fully human-written |
|
GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper suggests a novel CAPE method, which utilizes predicted graph structure for enhanced keypoint localization accuracy.
The method uses a graph VAE formulation to predict the graph, and further implements it iteratively within each decoder layer.
Using CGT, several sampled graphs are combined into a query-aware graph structure that aids in localization.
The authors show competitive results on the MP100 dataset.
- The paper suggests a novel method that deals with a limitation of recent graph-based methods.
- The paper is well written, and the proposed solution looks solid and practical.
- SOTA results compared to other CAPE methods on the MP100 dataset.
- Using only Fs to predict the adjacency matrix suggests that the structure information is embedded in Fs in the first place.
As self-attention can be seen as an all-to-all information sharing mechanism, an explanation of why self-attention can't learn the relevant connections between keypoints should be added or even proven.
Specifically, the authors should explain how the current i-SVAE design adds to the self-attention already in the decoder.
- Iterative Graph Prediction - The suggested method works iteratively, predicting a different adjacency matrix for each decoder layer.
An ablation study, showing why using a different adjacency matrix in each decoder layer, compared to using only one predicted adjacency matrix (using the output features of the encoder, for example), should be presented to support the iterative superiority claim.
- Qualitative skeleton visualization - Figure 3 is hard to understand.
It would be helpful to add the skeleton visualizations on top of the images, and not only show the adjacency matrix.
Maybe the width or opacity can correspond to the weight. It is crucial to make it easier to understand what structure is actually learned.
Small Note:
- Figures 4 and 5 label your method as AutoCape instead of GenCape.
- CGT - the adjacency matrices are sampled using the predicted mean and variance. Thus, it's not clear to me why each sample has its own mean and variance values (line 263), given that they are sampled from the same distribution.
This is further shown in equation 9, where alpha_n is not dependent on n at all.
- See weaknesses for other questions.
I'm willing to raise my score if the authors address my concerns. |
Fully human-written |
|
PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces PD²GS, a novel framework for reconstructing and modeling articulated objects from multi-view images without manual supervision. Its core idea is to represent an object's various interaction states as continuous deformations of a single, shared canonical 3D Gaussian field, enabling smooth control and interpolation. Key contributions include a coarse-to-fine segmentation that automatically discovers rigid parts by clustering motion trajectories and refining boundaries with SAM, and the release of the RS-Art dataset for real-world evaluation.
The core idea of modeling all interaction states as continuous deformations of a single, shared canonical 3D Gaussian field is both simple and powerful. This elegantly sidesteps the "representational fragmentation" of prior two-state methods, enabling smooth, continuous control and interpolation of articulated poses, which is a major step towards high-fidelity digital twins. The framework automatically infers the number and boundaries of rigid parts without manual supervision. It achieves this through a clever coarse-to-fine process that first clusters Gaussians by their motion trajectories (using a VLM for part counting) and then refines boundaries using SAM, making it highly applicable to real-world objects with unknown kinematics.
The empirical evaluation lacks comparison to foundational dynamic scene representation works like D²NeRF (Dynamically Deformable NeRF) or Gao et al.'s deformable 3DGS, which also model scenes via a canonical field and latent-code-driven deformation. This omission makes it difficult to assess the true novelty and contribution of the deformation modeling component beyond the specific task of articulation.
The method is explicitly noted to assume "accurate camera poses," and its robustness to pose estimation noise, a common issue in real-world applications, remains entirely unvalidated. This is a significant practical limitation that is not addressed through ablations or sensitivity analysis, casting doubt on the method's real-world readiness. While tested on objects with up to three parts, there is no evidence provided for the method's performance on objects with a higher number of articulated parts (e.g., >5). The clustering and segmentation pipeline may face challenges with increasing complexity, and its scalability remains an open and significant question.
1.Dynamic Scene Baselines: Why were foundational dynamic scene representations like D²NeRF or other deformable 3DGS methods not included as baselines? A comparison would help clarify whether the performance gains are specific to the articulated object modeling pipeline or also represent a general advance in deformation field modeling.
2.Camera Pose Robustness: The paper states an assumption of accurate camera poses. Could you provide an ablation or sensitivity analysis on the robustness of PD²GS to noisy camera poses, which are common in real-world SfM pipelines? This would significantly strengthen the claim of real-world applicability.
3.Handling Occlusion: The limitation of being unable to reconstruct unobserved geometry is acknowledged. Have the authors considered or experimented with incorporating learned or data-driven priors (e.g., diffusion models, symmetry) to plausibly complete the occluded parts of an object, especially around joints? |
Fully AI-generated |
|
PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work presents PD$^2$GS, a framework for modeling articulated objects that overcomes the fragmentation and drift issues in existing self-supervised methods. It learns a shared canonical Gaussian field and represents arbitrary states as continuous deformations, jointly encoding geometry and kinematics. By associating each state with a latent code and using vision priors for part boundary refinement, PD$^2$GS enables accurate part-level decoupling while maintaining coherence. The method supports part-aware reconstruction, continuous control, and kinematic modeling without manual supervision.
1. The paper introduces a unified framework that models articulated objects through continuous deformations of a shared canonical Gaussian field, effectively addressing the fragmentation and drift issues inherent in previous discrete-state reconstruction methods.
2. The method achieves part-level decoupling without manual supervision by leveraging generic vision priors and latent code associations, enabling fine-grained continuous control over articulated configurations.
3. The paper contributes RS-Art, a valuable real-to-sim RGB-D dataset with reverse-engineered 3D models, facilitating rigorous evaluation on real-world data.
1. The reconstruction results exhibit excessive noise, particularly evident in the real-world examples shown in Figure 13, which raises concerns about the method's robustness in practical scenarios.
2. In Section 3.2 on deformable Gaussian splatting, the methodology bears strong similarity to existing 4DGS works such as [a], yet these related approaches are not cited or discussed.
3. The paper does not provide information about inference time per sample, which would be valuable for understanding the practical applicability of the method.
4. There are some related works that are missing in the paper: [b][c][d][e]
[a] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering;
[b] SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects;
[c] Part2GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting;
[d] REACTO: Reconstructing Articulated Objects from a Single Video;
[e] NAP: Neural 3D Articulation Prior.
Please see the weaknesses. I am hesitant about the rating primarily due to the reconstruction quality. Since this is fundamentally a reconstruction task, the results appear too coarse and do not meet the expected level of fidelity for such work. |
Moderately AI-edited |
|
PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper addresses the problem of reconstructing articulated objects from multi-view, multi-state observations. The approach first learns a smooth deformation of a shared canonical field for each interaction state, and then uses the resulting deformation trajectories to guide a progressive coarse-to-fine part segmentation. The segmentation is further refined using SAM-based cues and boundary-aware Gaussian splitting. The method then estimates per-part meshes as well as joint types and parameters. In addition, the paper introduces a new dataset, RS-Art, containing a large number of real-world captures of articulated objects.
1. The newly proposed dataset RS-Art should be useful for further research work if made public, especially those real-world captures.
2. The paper seems to achieve SOTA performance than baselines with multi-state multi-view images in most cases.
3. The authors conducted extensive experiments on different datasets.
1. The whole systems seem to compose of numerous parts, which may be a little complicate and hard to extend.
2. Some visualizations on the newly-proposed dataset, including the data itself and the reconstructed results in videos would help readers grasp the new dataset.
3. The proposed method seem to be a little incremental though it achieves the best performance in most cases. It didn't deal with physical plausibility like 3D penetration. Its setting is also not unique as the main difference with previous method is changing from two-state to multi-state. The authors may elaborate on what's new that we are learning when building articulated objects.
Though I still have the mentioned concerns, I currently vote for borderline accept due to the extensive experiments and the SOTA performance.
See above. |
Fully human-written |
|
PD$^{2}$GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes PD²GS, a self-supervised framework for articulated object modeling using 3D Gaussian Splatting. It learns a shared canonical Gaussian field and represents each interaction state as a continuous deformation via latent codes. A coarse-to-fine segmentation clusters Gaussian primitives by deformation trajectories and refines part boundaries using SAM-guided splitting, enabling part-level reconstruction and motion estimation. The authors also introduce RS-Art, a real-to-sim RGB-D dataset for evaluating generalization. Experiments show strong improvements over prior work on both synthetic and real objects.
- Technical contribution: the paper proposes a conceptually elegant unification of geometry and kinematics via continuous deformation of a canonical Gaussian field. Coarse-to-fine segmentation combining motion trajectories with SAM-driven boundary refinement is both novel and effective.
- RS-Art dataset is a meaningful contribution, bridging synthetic–real gaps with paired RGB-D data and 3D models.
- Comprehensive experiments on an expanded PartNet-Mobility split and the new dataset demonstrate strong performance and generalization.
- Pipeline is complex and involves many heuristic components, which limited the scalability of the method.
- The method proposed in the paper seems to require multiple states, which puts forward more requirements for the data curation. Furthermore, ensuring that the camera coordinate systems of all states are aligned is a challenge. Outside the laboratory environment, such as in simple home scenarios, it is difficult for us to obtain states with multiple coordinate systems aligned, and the errors caused by coordinate misalignment are very likely to lead to failure.
see weakness |
Lightly AI-edited |
|
Building Data Framework and Shifting Perspectives for First-Person Exploration of Social Intelligence in LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents EgoSocialArena, a first-person evaluation framework for assessing social intelligence in large language models. The framework spans three dimensions — Cognitive, Situational, and Behavioral intelligence — covering cognitive intelligence evaluation (static and dynamic), counterfactual and parallel-world adaptation, and goal-driven human–LLM dialogues. Evaluations across 14 models and human baselines reveal that advanced models (e.g., GPT-5, Claude-sonnet-4) approach human-level reasoning in structured settings but still underperform in real-world and emotionally grounded contexts. The work offers a novel and scalable benchmark for studying socially adaptive behavior in LLMs.
(1) Novel evaluation perspective:
The paper proposes EgoSocialArena, a first-person evaluation framework that shifts LLM social intelligence testing from observer-based settings to first-person, ego-centric ones. This conceptual move is both original and timely, as it aligns model evaluation more closely with real world human–AI interaction.
(2) Explicit framework design:
The three layer taxonomy, Cognitive, Situational, and Behavioral intelligence, provides a clear and systematic structure for assessing different components of social reasoning.
(3) Comprehensive and scalable evaluation:
The study systematically evaluates both legacy and newly released LLMs across a wide parameter spectrum, from small chat-oriented models to frontier systems, as well as human benchmarks. Moreover, the authors construct a scalable evaluation dataset that integrates a sufficiently diverse set of scenarios, enabling broad comparison across models and human participants.
(1) Lack of robustness checks and reproducibility details:
Reproducibility is hindered by missing information about the evaluation setup. The paper does not specify decoding hyperparameters (e.g., temperature, top-p, top-k), nor whether results (e.g., scores in Table 2) were deterministic or averaged over multiple runs. Without these details, it is difficult to assess the stability and robustness of the reported accuracy scores.
(2) Inconsistent human-model setup in the Dynamic Cognition Evolution tasks:
The human baseline appears to be a static, one-shot questionnaire (as described in Section 4.1 and 4.2 ), whereas LLMs participate in multi-round interactive gameplays. Because the human group receives no iterative feedback or contextual memory, the two conditions differ substantially in input dimensionality and adaptation depth. This asymmetry undermines the statistical validity of the claimed human-level comparison and makes it unclear what aspect of “dynamic cognition” is actually being measured.
World-consistency in the Parallel World scenarios:
The paper claims to evaluate situational adaptation under parallel-world settings, yet some examples (e.g., Figure 2: “a robot living in an underwater city flaps its arms like a bird”) appear to mix incompatible environmental logics. It is unclear what the “correct” answer is in such cases. For instance, if the expected choice (e.g., option A) assumes that an underwater robot lacks the concept of flying creatures, then the very mention of “bird wings” in the question itself introduces a world-consistency error, as it relies on real-world ecological knowledge absent from that world. Could the authors clarify how world consistency was maintained during dataset construction? Were annotators instructed to avoid cross-domain references such as “birds” in non-terrestrial settings? If not, how might such inconsistencies affect the interpretation that these tasks test situational reasoning rather than surface-level linguistic analogy? |
Moderately AI-edited |
|
Building Data Framework and Shifting Perspectives for First-Person Exploration of Social Intelligence in LLMs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces an innovative framework and methodology for evaluating the Social Intelligence (SI) of Large Language Models (LLMs), structured around three pillars: Cognitive, Situational, and Behavioral intelligence. The core innovation is the shift to a First-Person Exploration (FPE) paradigm, positioning the LLM as an active participant in complex social scenarios. A new benchmark, SocialNav, is built on the high-stakes Chinese Dark-Lords' Game (CDLG) to test real-time decision-making, deception, and adaptive strategy. Experimental results reveal that even state-of-the-art LLMs show low absolute performance, underscoring significant current limitations in achieving true social intelligence in dynamic, interactive tasks.
1. Perspective switching is a very important ability when performing Theory of Mind. "perspective shift can elicit social capabilities
similar to Chain-of-Thought elicit math capabilities" is a novel and nice idea.
2. The authors contribute a new dataset SocialNav, which is valuable to the community.
3. The authors conducted lots of experiments and provide benchmark results and in-depth analyses of the results.
1. The hallucination problem in LLM-generated datasets is still there. With no human verification, the dataset quality is questionable, which reduces the dataset's value.
2. I think this paper is a "benchmark and dataset" paper. Maybe you should submit it to a dataset track?
3. There are many related papers on similar games, such as Avalon, Hanabi. What are the differences between yours and those papers?
4. I feel there is an "overclaiming" problem with the paper's main arguments. Perspective switching is simple and such designs already exist in previous papers. The three types of intelligence seem weird. Cognitive intelligence includes situational intelligence. In fact, most cognitive reasoning is situational. The names are not reasonable.
5. The performance gap between gpt and human is not big. What does this mean? GPT already develops human-level social intelligence? The task is not challenging anymore? Or there is some potential problems with the evaluation methods?
6. Why not build an agent architecture model for the task?
7. Do you try different seeds for your tests? I see no error bar.
See above.
Overall, I feel the paper is overclaiming. The contributions and novelty are not as big as claimed. |
Fully human-written |
|
Building Data Framework and Shifting Perspectives for First-Person Exploration of Social Intelligence in LLMs |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces EgoSocialArena, a framework designed to evaluate the social intelligence of LLMs from an egocentric perspective across three key dimensions of social intelligence: cognitive, situational, and behavioral.
The framework is built from converted existing social intelligence datasets to ensure coverage and scalability, with an emphasis on egocentric reasoning, which the authors argue better reflects real-world LLM agent scenarios.
- Cognitive intelligence is divided into two components:
1. Static cognition, derived from existing ToM benchmarks by replacing characters with "you".
2. Dynamic cognition, evaluated through simulated game in G0.8A and Limit Texas Hold’em. Both adversarial reasoning games are against either rule-based or reinforcement-learning agents of varying skill levels.
- Situational intelligence is tested by converting subsets of existing benchmarks into egocentric question formats, supplemented with counterfactual and parallel-world scenarios (e.g., an underwater city).
- Behavioral intelligence is assessed through human–AI interaction by SOTOPIA, a social simulation sandbox.
The authors evaluate 15 LLMs and find that OpenAI-O3 performs best on cognitive intelligence, approaching human-level performance. For situational intelligence, Claude 4 and GPT-5 perform comparably but remain below human performance.
I enjoy how the paper proposes three core aspects of social intelligence: cognitive, situational, and behavioral, and highlights the limitations of the current benchmarks for addressing only one of them.This provides a clear taxonomy and conceptual framework for studying social intelligence in LLMs, offering valuable guidance for future research directions in the community.
The paper also discovers interesting findings, such as replacing the third person view with the first person view improves performance, especially for weaker models, which reveals certain aspects of LLM's asymmetric ability between first and third person view point.
The paper does a good job of covering all three aspects by adapting and extending existing benchmarks and datasets. It also evaluates both leading closed models and popular open-source ones, giving a useful snapshot of current capabilities.
Finally, the paper is clearly organized and supported by helpful figures that make the framework and results easy to follow.
The framework’s task design and organization feel somewhat inconsistent across the three categories of social intelligence. Below are my main concerns for each dimension:
- Cognitive intelligence: despite interesting findings in LLM's asymmetric ability between first and third person view point, the paper's claim that such first person view would benefit real world agent use cases is a bit unsupported. I would suggest adding a specific case or experiment to support this claim.
- Situational intelligence: the definition of situational intelligence is unclearly motivated. In my perception such intelligence, especially from Social IQA, is testing model's ability to perform common sense reasoning under different social situations, while the counterfactual and parallel world modifications seem a bit irrelevant. It would be more sound to modify social rules instead of rock-paper-siccor's rule. And it would be more sound to modify social characters, such as demographic information, than modifying things to be moon coloney.
- Behavioral intelligence: the way of measuring behavioral intelligence is kind of contradicting with the scalability design principle that the author positions this work at first. It is also quite short of novelty by crowd sourcing human-AI interactions. Moreover, I saw that GPT5 is outperforming human on Sotopia, which is not discussed.
Overall, the paper lacks definitions and supporting literature for each type of intelligence they categorize on. The logical structure might be able to be improved, as the current one contains certain inconsistency and unclear focuses.
Here's a mix of questions and suggestions:
1. The discussion of the improvements from switching third person to first person view is quite interesting and is probably worth to dive a little deeper by analyzing reasoning chain, etc.
2. The process of converting data with human verification should be disclosed more to provide support for validity.
3. It looks like the rock-paper-scissor example are rather irrelevant to social intelligence but more relevant to general reasoning and common sense. Can you provide some more relevant ones?
4. Why do you think using parallel worlds such as underwater city would affect social intelligence? I find certain adversity but the reasoning remains unclear to me.
5. I suggest apply counterfactual modifications and environment diversifications with a focus on social situations.
6. I highly recommend discussing the result that a few close source models like GPT5 outperform human on Sotopia, either through semantic analysis on GPT5's strategies, or the LLM judges' imperfection.
7. I suggest adding a few more discussion about each of cognitive, situational, and behavioral intelligence to provide a clear definition. |
Fully human-written |
|
Building Data Framework and Shifting Perspectives for First-Person Exploration of Social Intelligence in LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses three major limitations in current evaluations of social intelligence in large language models (LLMs): (1) the reliance on single-dimensional assessment, (2) the dominance of a third-person “observer” perspective, and (3) the lack of intuitive comparison with human behavior. To tackle these issues, the authors propose a new framework called EgoSocialArena, which adopts a first-person (ego-centric) perspective and conducts systematic evaluations along three dimensions — cognitive, situational, and behavioral intelligence.
For cognitive intelligence, the paper converts classic Theory of Mind (ToM) datasets based on the Sally-Anne test into a first-person format to assess static cognition, and employs two multi-round games — Number Guessing (G0.8A) and LIMIT TEXAS HOLD’EM — to evaluate dynamic cognitive evolution during interactions.
For situational intelligence, the authors transform datasets such as SocialIQA, EmoBench, and ToMBench into first-person narratives to test real-world contextual understanding. They also manually construct Counterfactual Situations and Parallel World Situations to evaluate adaptability to non-standard social rules.
For behavioral intelligence, the framework builds upon the SOTOPIA dataset to assess goal achievement in human–machine interaction scenarios.
The evaluation based on EgoSocialArena yields several key findings:
1. In cognitive intelligence, OpenAI-O3 performs almost achieve comparable with humans, while in behavioral intelligence, GPT-5 and Claude-sonnet-4 even surpass human participants in task completion.
2. The first-person perspective serves as a “performance catalyst” for most models, yet paradoxically causes performance drops in top-tier models such as GPT-5 and OpenAI-03, suggesting that existing third-person ToM benchmarks may overestimate model capabilities.
3. Strong reasoning ability contributes to cognitive intelligence but remains insufficient for situational intelligence without rich social knowledge and contextual grounding.
4. Advanced models exhibit distinct behaviors. For example, GPT-5’s conversational expressions are somewhat rigid and repetitive, giving humans the distinct impression of conversing with a machine, whereas Claude-sonnet-4 frequently produces emotionally-laden expressions.
- This paper identifies key limitations in current evaluations of social intelligence in LLMs, such as the narrow focus on single-dimensional metrics and the lack of assessment settings involving human–machine interaction.
- It introduces novel evaluation setups, including human–machine interaction dialogues for assessing behavioral intelligence and Parallel World and Counterfactual tasks for evaluating situational intelligence, which have strong potential to become an important contribution to the field.
- The paper does not provide sufficient evidence showing what distinct model behaviors or findings the proposed framework reveals compared with prior social intelligence benchmarks (e.g., the agent–agent interaction patterns in SOTOPIA). The evaluation results lack an in-depth comparative analysis with previous studies, which to some extent weakens the justification for the necessity and uniqueness of the new framework.
- While the paper divides social intelligence into three pillars — cognitive, situational, and behavioral intelligence — it does not thoroughly discuss the theoretical relationships among these dimensions. For instance, are these three dimensions complete and orthogonal? Do they exhibit potential overlap (e.g., does behavioral intelligence inherently encompass cognitive and situational intelligence)? Moreover, when assigning certain datasets (such as ToMBench) to a specific dimension, the paper provides insufficient justification for why a given task is treated as assessing only that single dimension, rather than reflecting a composite of multiple abilities.
- The evaluation tasks within the framework are primarily adaptations or integrations of existing benchmarks, showing relatively limited originality or methodological innovation in task design itself.
- The study recruits 50 graduate students as baselines for multiple-choice evaluation and selects 10 participants for interactive dialogue assessment. This small sample size — especially the 10-person subset for interactive testing — may lack statistical representativeness, potentially limiting the reliability and generalizability of the human–model comparison.
- Given that the datasets used to evaluate “static cognition” and “real-world situational understanding” are adapted from earlier publicly available static benchmarks, have the authors considered the potential risk of data leakage or training set contamination when assessing the latest LLMs? Since many modern models may have been trained on these benchmarks or their variants, it would be important to clarify what measures were taken to ensure the validity and fairness of the evaluation results.
- The paper mentions that the “Parallel World Situation” and “Counterfactual Situation” datasets were manually constructed, yet it provides no detailed description of their construction process. Could the authors share more information about how these datasets were developed? For example, who created the data (e.g., domain experts, research assistants, or crowdworkers)? What specific guidelines or instructions were given to the annotators during data creation? Additional transparency in these aspects would help readers better assess the data quality, annotation consistency, and potential biases inherent in the newly introduced datasets. |
Moderately AI-edited |
|
Building Data Framework and Shifting Perspectives for First-Person Exploration of Social Intelligence in LLMs |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces EgoSocialArena, a novel and comprehensive framework for evaluating the social intelligence of Large Language Models (LLMs). The authors argue that existing benchmarks are fragmented, focusing on single pillars of social intelligence (cognitive, situational, behavioral) and primarily use a third-person, passive-observer perspective that misaligns with real-world agent applications. To address this, EgoSocialArena systematically evaluates LLMs from a first-person, egocentric perspective. Its key contributions include: (1) a method for converting third-person Theory of Mind (ToM) benchmarks to a first-person format; (2) the use of rule-based and reinforcement learning agents in interactive games (G0.8A, Texas Hold'em) to assess dynamic cognition evolution; and (3) the inclusion of non-standard situations (counterfactual and parallel worlds) to test situational adaptation. The paper presents a substantial evaluation of 14 foundation models against a human baseline, revealing that while frontier models are closing the gap in cognitive intelligence, significant room for improvement remains in situational intelligence and behavioral nuance.
Novel and Well-Motivated Conceptual Framework: The core idea—shifting social intelligence evaluation from a third-person to a first-person perspective—is timely, well-justified, and addresses a genuine gap in the literature. The three-pillar structure (cognitive, situational, behavioral) provides a holistic and systematic approach to a complex construct.
Methodological Rigor and Innovation: The proposed methods are inventive. The perspective conversion workflow is a practical contribution. The design of the dynamic cognition scenarios (G0.8A with multi-level rule-based agents and Texas Hold'em with RL agents) is sophisticated and provides a more authentic test of an LLM's ability to model an opponent's strategy over time.
Comprehensive and Scalable Evaluation: The evaluation is extensive, covering 14 models, including the most recent frontier models (GPT-5, Claude-sonnet-4, o3). The inclusion of a carefully collected human performance baseline is a significant strength, allowing for a meaningful interpretation of model scores. The authors correctly emphasize the framework's extensibility.
Valuable and Actionable Insights: The results go beyond mere leaderboards and offer insightful analysis. Key findings—such as the "performance catalyst" effect of the first-person perspective for most models, the limitations of pure reasoning models (e.g., DeepSeek-R1) in social situations, and the need for new behavioral metrics beyond "believability"—are valuable for the research community.
High-Quality Presentation: The paper is generally well-written, logically structured, and professionally formatted. The use of tables and figures is effective, and the inclusion of ethics and reproducibility statements is commendable.
Limited Scale of Behavioral Intelligence Data: The most significant weakness is the relatively small scale of the behavioral evaluation. With only 40 dialogue scenarios and a subset of 10 human participants, the findings in this critical dimension, while insightful, are based on a limited sample. This makes the strong claims about models surpassing humans in "goal completion" less statistically robust than the results from the larger-scale cognitive and situational evaluations (~1200 samples each).
Ambiguity in Baseline and Opponent Design:
Human Baseline: The description of the human baseline, while a strength, could be more detailed. Were the 50 graduate students compensated? Were they screened for specific backgrounds? A more detailed protocol would bolster the credibility of this crucial benchmark.
Rule-based Agents: The rationale for the specific cognitive levels (e.g., why an arithmetic sequence for Level 2?) is somewhat under-explained in the main text. A stronger justification for why these specific rule sets effectively represent increasing cognitive complexity would strengthen the dynamic cognition evaluation.
Writing and Statistical Minor Issues:
Figure Referencing: The text frequently references figures (e.g., Figure 1(A-C), Figure 3, Figure 4) that are not included in the provided excerpt. A reviewer would need these to fully assess the claims.
Metric Explanation: The scoring ranges for behavioral metrics (e.g., secret [-10-0], relationship [-5-5]) are mentioned but not explained. A brief sentence or citation on how these scores are determined by the GPT-4 evaluator would be helpful.
Statistical Testing: The paper reports score differences but does not appear to employ statistical significance tests. Stating whether the observed gaps (e.g., the 2.3 point difference between o3 and humans) are statistically significant would add weight to the conclusions.
Repetition: The main findings are repeated in the abstract, introduction, and experiment sections. While common, some tightening could improve conciseness.
Behavioral Data Scale: Given that behavioral intelligence is a core pillar of your framework, why was the dataset limited to 40 dialogues? Was this a constraint of human evaluation resources? Do you plan to scale this up in future work?
Generalization of Perspective Conversion: Your method for converting third-person to first-person benchmarks is a key contribution. How generalizable is this prompt-based method? Did you encounter any systematic failure cases or scenarios where the conversion was ambiguous or altered the fundamental reasoning required?
Baseline Agent Selection: For the dynamic cognition tasks, why were the specific rules for Level 2 (arithmetic sequence) and Level 3 (copying the gold value) chosen? Were other, potentially more human-like strategies considered and rejected?
Defining "Social Intelligence": The framework excellently decomposes social intelligence into three pillars. However, the behavioral results show GPT-5 achieving high goal completion but with "rigid" dialogue, while Claude-sonnet-4 uses more emotional expressions. How should the field weigh task efficiency against social authenticity when defining "socially intelligent" behavior?
Evaluation of Evaluation: You rightly point out that metrics like "believability" are saturating. Could you elaborate on how you would operationalize your proposed new dimensions, such as "sophisticated conversational strategies" and "emotionally expressive communication," in an automated or semi-automated way? |
Fully AI-generated |
|
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces **Scene-R1**, a video-grounded vision-language model (VLM) for 3D scene understanding that operates **without point-wise 3D annotations**. The core innovation lies in a **two-stage reinforcement learning** pipeline: temporal grounding followed by image grounding, both guided by lightweight rewards such as IoU and format compliance. Additionally, the paper extends the **GRPO** approach to 3D understanding by introducing an *exact-match reward*, achieving performance comparable to 3D LLMs that rely on point cloud inputs. Qualitative results further demonstrate the effectiveness of the model’s reasoning process.
1. The method removes the need for 3D point-wise instance labels while maintaining competitive performance under weak supervision.
2. By explicitly outputting chain-of-thought (CoT) reasoning, Scene-R1 improves interpretability compared with previous 3D LLMs, aligning with the growing emphasis on model transparency and explainability.
3. The two-stage RL structure (temporal then spatial grounding) provides flexibility and task generality across different 3D understanding tasks.
1. **The performance against other 3D LLMs remains limited**. The comparison with **VLM-Grounder** is not entirely fair, as it is a training-free agent and the reported results are based on only a 250-sample subset. For a more rigorous evaluation, performance should be assessed on the same benchmark samples used by VLM-Grounder. Although the paper claims that the proposed method does not require instance masks, the distinction between bounding-box-based and segmentation-based supervision is largely mitigated by the use of pretrained **SAM**. Moreover, the baseline **LLaVA-3D** similarly does not depend on pre-extracted 3D bounding boxes or per-instance segmentation, and should therefore be regarded as a **direct and comparable baseline** to the proposed approach.
2. **Similar grounding method:** The concept of back-projecting SAM masks to obtain 3D bounding boxes is not novel. The authors do not clearly distinguish their method from prior approaches such as **VLM-Grounder**.
3. **Limited benchmarks:** The RL framework is introduced not only for transparency but also for generalization. However, the evaluation is restricted to in-domain datasets. Cross-dataset evaluations on **Nr3D [1]**, **Multi3DRefer [2]**, or **Video-MME [3]** are encouraged to validate generalization.
4. **3DVQA implementation:** The paper claims that Scene-R1 can be fine-tuned for 3D-VQA tasks (L272). However, neither the training data nor the evaluation includes 3DVQA datasets such as **ScanQA [4]**, **SQA [5]**, or **MMScan [6]**. Since **VSI-Bench** does not provide a training set, it is unclear what data were used.
5. **Efficiency concerns:** The proposed multi-stage grounding combined with a DeepSeek-R1-style reasoning process substantially reduces efficiency. Ablation results show that the thinking process yields only marginal performance gains, casting doubt on the overall effectiveness of the proposed method.
[1] https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123460409.pdf
[2] [[2309.05251\] Multi3DRefer: Grounding Text Description to Multiple 3D Objects](https://arxiv.org/abs/2309.05251)
[3] [[2405.21075\] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https://arxiv.org/abs/2405.21075)
[4] [[2112.10482\] ScanQA: 3D Question Answering for Spatial Scene Understanding](https://arxiv.org/abs/2112.10482)
[5] [[2210.07474\] SQA3D: Situated Question Answering in 3D Scenes](https://arxiv.org/abs/2210.07474)
[6] [[2406.09401\] MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations](
1. How does RL fine-tuning on grounding tasks improve performance on **VSI-Bench**? What prompts are used during VSI-Bench evaluation?
2. What is the **ablation setting**? The reported ablation results seem inconsistent with the main table. Additionally, what supervised fine-tuning (SFT) configuration is used in these ablations?
3. In L141, the authors state:
*“We exploit this prior to teach it to understand the 3D world and minimize the amount of task-specific reinforcement learning (RL). The same architecture is used in all tasks optimized with GRPO.”*
How exactly is the amount of task-specific RL minimized, given that the method introduces several task-specific rules, such as temporal and image grounding?
4. What do the **failure cases** look like? The paper presents only successful examples. A detailed failure mode analysis would provide deeper insight into the limitations of the proposed approach. |
Lightly AI-edited |
|
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Scene-R1, a framework for 3D scene reasoning that operates directly on RGB-D video streams and, critically, requires no 3D point-wise annotations for training. The method uses a two-stage, VLM-based pipeline: (1) temporal grounding to select relevant video snippets and (2) image grounding to predict 2D bounding boxes. These 2D predictions are then lifted to 3D using SAM2 and a refinement module. The entire pipeline is optimized using reinforcement learning (GRPO), which both trains the model using lightweight 2D/textual rewards and encourages the generation of explicit chain-of-thought rationales for interpretability. The model is evaluated on 3D visual grounding, affordance grounding, and VQA, demonstrating competitive performance against other annotation-free baselines.
1. The most significant strength is the "annotation-free" nature of the 3D instance labeling. By learning from 2D bounding boxes and textual labels, the method drastically lowers the supervision requirements for 3D scene understanding, making it more scalable.
2. The integration of R1-style reinforcement learning to produce explicit chain-of-thought rationales adds a strong interpretability component, which is lacking in most 3D-aware LLMs.
3. The quantitative results are solid, showing that Scene-R1 outperforms other annotation-free baselines on several benchmarks (ScanRefer, SceneFun3D, VSI-Bench), validating the effectiveness of the proposed approach.
1. The system's design is a complex pipeline of multiple, powerful, pre-trained models (Qwen2.5-VL, SAM2, and a module inspired by SAI3D). This makes it difficult to ascertain how much of the strong performance is attributable to the novel RL framework versus the inherent power of these individual components.
2. The method's reliance on ground-truth depth ($D_t$) and camera poses ($T_t$) is a significant assumption. This data is not available in general "in-the-wild" videos and is the same data required to create the point clouds for detector-based methods. This weakens the claim of "bypassing 3D scene reconstruction" and limits the method's applicability to settings where a full 3D capture setup is already available.
3. The 2D-to-3D lifting process has several stages (2D box prediction, SAM2 segmentation, depth-based back-projection, refinement). This multi-step process seems susceptible to cascading errors, where a poor 2D box from the VLM could lead to an irrecoverably bad 3D localization.
1. How critical is the explicit depth channel ($D_t$) and ground-truth pose ($T_t$)? What is the performance degradation if the model is run on RGB-only video and must rely on estimated depth/pose, or if it must operate without them? This seems to be the key bottleneck for real-world application. |
Fully AI-generated |
|
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a video-grounded large vision-language model (VLM) that performs 3D scene reasoning and grounding without any point-wise 3D annotations. Instead of relying on pretrained 3D detectors, Scene-R1 integrates reinforcement-learning-driven reasoning (R1-style) with a two-stage grounding pipeline, enabling transparent, interpretable, and annotation-efficient 3D reasoning.
The proposed method, Scene-R1, builds on Qwen2.5-VL-7B and is fine-tuned using GRPO. In Stage 1 (Temporal Grounding), the model reasons over video sequences to identify the most relevant temporal segment corresponding to a textual query. In Stage 2 (Image Grounding), it localizes the target object in selected frames by predicting 2D bounding boxes, accompanied by explicit chain-of-thought explanations. These 2D predictions are then lifted to 3D using depth maps and refined via a zero-shot segmentation module, producing accurate 3D localizations without any 3D supervision.
1. Annotation Efficiency: Scene-R1 achieves competitive 3D reasoning and grounding performance without relying on dense point-wise 3D annotations or pretrained 3D detectors, greatly reducing the data and labeling cost.
2. The authors conducted comprehensive experiments with various existing works, and shows good performance.
1. The method rewards properly formatted CoT and task success (IoU/EM), but does not verify that the CoT is faithful to the internal decision path[1,2]
2. While the proposed pipeline has not been widely applied in existing 3D LLMs, its design does not represent a substantial conceptual departure from established video-grounding or multi-stage reasoning frameworks. The contribution feels more like an adaptation of existing ideas to a new input modality rather than a fundamentally novel approach.
[1] Sarkar, Advait. "Large language models cannot explain themselves." arXiv preprint arXiv:2405.04382 (2024).
[2] Kambhampati, Subbarao, et al. "Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!." arXiv preprint arXiv:2504.09762 (2025).
Please address the weakness mentioned above. |
Lightly AI-edited |
|
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a video-grounded LLM which do not use 3D instance annotations for training. Specifically, the input to the model is a video: the VLM is asked to predict the relevant frames and then ground relevant object in this relevant portion of the video. For training these modules, GRPO losses are used. Next, these 2D predictions are lifted to 3D — for this, each predicted mask is tracked across frames using SAM-2 and then the resulting masks from all frames are fused in 3D. Next, another merging strategy from a prior work (SAI3D) is used to obtain a sharper mask. This becomes the prediction of the model. The paper compares its methods with prior methods that utilize 3D supervision as well as methods that do not use 3D supervision. The paper claims better performance than methods that do not use 3D supervision. The ablations show that RL training and thinking help improve performance over supervised fine-tuning.
- The paper is well-written and easy to follow
- The premise of training models without 3D supervision is interesting; additionally exploring RL training for these models is interesting as well.
- A big claim of the paper is that their method do not use 3D annotations. However, I think that is not entirely true — in “image grounding” task, the proposed model trains for supervising the mask prediction of the relevant object for each image in the video. This requires two kinds of supervision: a) “grounding” supervision which tells the model which object it should be grounding. b) “mask supervision” of that object across ALL video frames. These labels in scannet come from projecting the GT 3D segmentation masks to 2D. I will further argue that 3D mask annotations and 2D video masks are equivalent supervision for a posed RGB-D video i.e. either of these can be obtained from the other one via 2D projection or 3D unprojection. Hence, either of these supervisions is equally costly or inefficient. Hence, the claim that this method trains without 3D annotation appears wrong to me
- In the same vein, the comparisons in table-1 are potentially unfair:
- In the “free from 3D instance or annotation supervision” section where the proposed method groups itself, the other baselines like vlm-grounder, open-scene and lerf do not use ANY supervision — neither any grounding supervision nor any mask supervision. The current method uses both these ground truth supervision as I argue in the first point
- In the fully supervised methods section, the baselines are significantly old. Current SOTA is UniVLG (https://arxiv.org/abs/2503.10745) an the authors can check Table-1 of UniVLG for additional recent baselines.
- “This architecture uniquely enables end-to-end reasoning directly on video streams, bypassing the need for offline 3D scene reconstruction”: This is a statement made in the introductions, however, I think section 4.3 which lifts the 2D masks to 3D uses the reconstructed point clouds, and so do all the evaluations that follow in Table-1.
The main question in my review, as I explain in the weakness section, is that the claim of not using 3D annotations seems false and the comparisions with zero-shot methods unfair. Any clarification would help here. |
Fully human-written |
|
HTS-Adapt: A Hybrid Training Strategy with Adaptive Search Region Adjustment for MILPs |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper improves the predict-and-search framework for solving mixed integer linear programmings (MILP). It mainly addresses two issues, namely prediction accuracy and static search regions that might be missing feasible solutions or unnecessarily large for the search space. To address the first issue, it uses a hybrid training strategy that it separates treatment for stable and unstable variables. Stable variables have identical values across different solutions and are trained with cross-entropy loss for precision. Unstable variable are trained with contrastive loss. This is a hybrid combination of two previous works (PaS and ConPaS). Secondly, it employs adaptive search region adjustment for expanding the search space only when infeasibility arises, guided by irreducible infeasible subsystem (IIS). Experimental results show good performance over four MILP benchmark instances, where the proposed method is better than the baselines in terms of primal solution quality and primal gap, and also faster convergence to good solutions.
1. The paper has two main contributions, one is to use the hybrid loss based on variable types for training, and the other is using adaptive search region adjustment.
2. Experimental results show that the new method outperforms several strong baseline such as PaS and ConPaS in terms of objective values and primal gaps.
1. The contributions are incremental It is ok to be incremental if the methods work well with good insights provided either theoretically or empirically. However, the paper mainly describes the methods and shows that it works, but without good explanation or analysis. For example, you could show the precisions of your prediction of the stable/unstable variables compared to PaS and ConPaS
2. The value additivity of each of the two contributions are not clear. I was looking for a more thorough ablation study (on more MILP problems) to understand the value of each components.
3. The clarity of Section 4.2 is bad. It hurts the overall clarity of the paper since this part is one of the main contributions. I wasn’t able to understand how C1, C2 are computed and how r_max is chosen. Can you provide an algorithmic description of line 11 in Algorithm 1?
1. In table 4, Adapt-only has smaller values for the best k0, k1 and delta than HTS-only, how do you explain this? How are these parameters determined?
2. How sensitive is your method to parameter r_max? How do you determine its value? |
Fully human-written |
|
HTS-Adapt: A Hybrid Training Strategy with Adaptive Search Region Adjustment for MILPs |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes HTS-Adapt, an approach that integrates two innovative techniques to enhance the predict-and-search pipeline. For the prediction stage, a hybrid loss combining BCE and contrastive loss is introduced to better identify high-quality solutions. For the search stage, an IIS-based procedure advances the PaS framework by dynamically adjusting the trust region. Experimental results demonstrate the superiority of HTS-Adapt over both the standard PaS and contrastive PaS methods.
1, The hybrid training strategy is insightful. It narrows the scope of "bad" variables via contrastive loss, while directly applying BCE loss to "good" variables—which are less likely to cause infeasibility—is both efficient and effective.
2. The experimental design is sound. The inclusion of solution trajectory visualizations and ablation studies enables a more thorough analysis of the results.
Baselines: Although Section 2 provides a detailed review of L2O literature, the experimental comparisons are limited to the "PaS → ConPaS → HTS-PaS" pipeline. The authors should consider a broader set of baselines, including learning-based branch-and-cut and learning-based LNS methods.
Novelty: While the HTS component is well-designed, it essentially combines two existing techniques. The IIS component, on the other hand, appears more rooted in operations research than machine learning, and also seems an employment of existing approaches. Thus, the novelty of the paper may be limited.
Clarity on IIS: The description of the IIS procedure is somewhat unclear, and I do not fully understand. For example, in Line 4 of Algorithm 1, what does "fix" entail when $\Delta\neq 0$—hard fixing (as in LNS) or soft fixing (as in local branching)? Additionally, does the second constraint set $C_2$ contain at most one element (i.e., the trust region constraint)?
1. There is an assumption that "some variables consistently exhibit identical values across high-quality solution sets." This makes sense, but to be more critic, what is the underlying intuition for this phenomenon? Furthermore, what would it imply if this assumption does not hold—would the HTS method reduce to contrastive learning, thereby invalidating the first contribution?
2. Algorithm 1 involves repeatedly solving MILPs within a 1000-second time limit. Since the total time limit in experiments is also set to 1000 seconds, does this imply that the repetition occurs only once?
3. In Figure 2, for MIS and MVC instances, the blue curve (HTS-Adapt) is significantly lower than others from the beginning, suggesting better initial prediction. However, for CA and IP instances, all methods perform comparably in the first 200 seconds, after which HTS-Adapt gradually outperforms the rest. How can this difference be explained? Should it be attributed to better prediction or more effective search? |
Lightly AI-edited |
|
HTS-Adapt: A Hybrid Training Strategy with Adaptive Search Region Adjustment for MILPs |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper targets two difficulties of the “predict-then-search” paradigm for Mixed-Integer Linear Programs: (i) learned predictions are often infeasible, and (ii) the search radius is fixed in advance. The authors propose HTS-Adapt, which contains: 1) Hybrid Training Strategy (HTS): supervised cross-entropy loss for “stable” variables that rarely change in near-optimal solutions, and contrastive InfoNCE loss for “volatile” variables; 2) Adaptive Search Region Adjustment (Adapt): whenever the predicted partial assignment is infeasible, an Irreducible Infeasible Subsystem (IIS) is computed to identify the culprit variables and their domains are enlarged selectively instead of naïvely expanding the whole trust region.
Experiments on four classic combinatorial benchmarks (MIS, MVC, CA, IP) show that coupling HTS-Adapt with SCIP reduces the average primal gap by more than 50 % compared with SCIP alone and with previous PaS/ConPaS baselines, while predicting a larger fraction of variables without degrading feasibility.
1) Clearly identifies the feasibility and static-radius issues of prior PaS methods.
2) Novel combination of cross-entropy and contrastive learning tailored to variable behaviour, together with targeted expansion of the search region via IIS.
3) Comprehensive empirical evaluation across four datasets; code and hyper-parameters are promised to be released.
1) Only SCIP is used as the back-end solver; no comparison with Gurobi or CPLEX to demonstrate solver-agnostic benefits.
2) Contrastive component uses plain InfoNCE; more advanced graph-contrastive losses are not explored.
3) No runtime breakdown (prediction / IIS / solver) is given, so the computational overhead of Adapt is unclear.
4) Missing theoretical analysis, e.g., worst-case number of IIS calls needed to regain feasibility.
1) How much latency does the IIS computation introduce at each node, and is there a lightweight approximate IIS routine for large instances?
2) The threshold for “stable” variables is empirical—does it remain valid across problem types, and could it be learned automatically?
3) When scaling to 10^6 variables, can the GNN still fit in GPU memory, and does the IIS routine remain tractable? |
Fully AI-generated |
|
HTS-Adapt: A Hybrid Training Strategy with Adaptive Search Region Adjustment for MILPs |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
To efficiently solve mixed integer linear programming (MILP), machine learning techniques are used to predict partial solutions, but often suffer from inaccurate and infeasible predictions. Therefore, this work extends the current Contrastive Predictand-Search (ConPaS) framework by two-fold. First, introducing a Hybrid Training Strategy (HTS) to achieve more accurate predictions. Second, proposing an Adaptive Search Region Adjustment mechanism (Adapt) to ensure feasibility and reduce computational overhead.
The experiments are sufficient, and the proposed method performs well.
1. The formatting needs improvement. For example, many citations are missing brackets, which makes the paper difficult to read. The plots in Figures 2 and 3 are too small, while the legends are too large. In addition, the tables should be resized, and the placement of Table 1 is odd—it is introduced in Section 5 but appears at the beginning of Section 4.
2. The preliminaries are insufficient. The work is largely based on ConPaS, but the paper only introduces PaS; the introduction and explanation of ConPaS are lacking.
3. There are several typos. For example, the transpose symbol in Equations (1) and (3). Also, should the c in Equation (2) be bolded? It is inconsistent with Equations (1) and (3).
1. Why were only MIS and MVC chosen to evaluate generalization?
2. The experiments use SCIP and Gurobi. Given that CPLEX is also a popular MILP solver, why wasn’t CPLEX included? |
Lightly AI-edited |
|
Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents a framework for dexterous hand teleoperation in dynamic object catching. The proposed system comprises four main stages: (1) training a reinforcement learning (RL) policy for skill acquisition and data collection, (2) collecting state–action–point cloud trajectories, (3) training a Diffusion Policy with Unsupervised 3D Representation (DP-U3R), and (4) teleoperation using a Dynamics-Aware Adaptive Integration Mechanism (DAIM). The DP-U3R leverages point-cloud observations to augment state representations for both action generation and geometric reconstruction, while the DAIM adaptively fuses teleoperation signals with policy-generated actions based on object dynamics and diffusion steps. The framework is evaluated on dynamic object-catching tasks, where a real-world teleoperation glove controls a simulated dexterous hand in Isaac Gym.
Although the paper considers a challenging and practically relevant problem in robotic teleoperation, its methodological novelty appears limited. The core contribution, DAIM, resembles a hand-crafted engineering heuristic rather than a principled formulation, and its design choices are neither thoroughly justified nor supported by sufficient ablation analysis. In particular, the blending mechanism between diffusion outputs and teleoperation signals seems overly simplistic. While the problem setting holds potential practical value, the current presentation does not meet the level of conceptual innovation typically expected for full acceptance.
- Problem Domain: The primary strength of this paper is its focus on a challenging and underexplored problem in robotics. (dynamic object interaction via teleoperation)
- Limited Methodological Novelty: The main concern with this paper is its limited novelty. The core framework is largely an integration of existing, well-established components (e.g., PPO for data collection, Diffusion Policy as the autonomous controller). The central contribution appears to be the DAIM, which arbitrates between human and policy actions. However, this mechanism itself feels more like a specific, hand-tuned engineering solution rather than a fundamental or generalizable new technique.
- Insufficient Rationale for Design Choices: The paper does not include a justification for the specific design of the DAIM (Section 4.6). There is no ablative study or theoretical reasoning provided to explain why the particular functions are optimal or even well-suited for this integration.
- Could you augment Figure 5 with a plot showing the value of the integration coefficient over time? This would provide a better understanding of how the system works between the user and the policy.
- Rationale for Functions: As mentioned in the weaknesses, the choice of sigmoid and cosine functions in Section 4.6 is not well-justified. Could you elaborate on the specific properties of these functions that led to their selection?
- The coefficients used in the dynamic factor seem hard-coded. It seems that the optimal values for these coefficients would depend on the properties of the object being caught (e.g., its shape, mass, or velocity). Could you provide an analysis of the system's sensitivity to these coefficients? |
Heavily AI-edited |
|
Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents a semi-autonomous approach for catching a flying object via teleoperation. The core idea is to train a catching policy using reinforcement learning in simulation, which is later guided by human gestures during deployment. The proposed method is validated using a static robotic hand in simulation.
The problem addressed in this work is quite challenging. Catching a flying object with a fixed dexterous hand requires extremely precise timing and coordination, and the paper’s attempt to tackle this problem is commendable.
The main weakness of this paper is that the proposed problem could likely be solved with a fully automated policy, without requiring human guidance. In fact, similar catching problems have been addressed in earlier works [1]. Furthermore, the absence of real-world experiments significantly limits the paper’s practical significance and overall impact.
[1] A Dynamical System Approach for Softly Catching a Flying Object: Theory and Experiment. TRO 2016.
N/A. I would like the authors to discuss more about their motivation.
I believe one way to strengthen this paper is to show real world results on catching complex, irregular objects where prior methods fail. |
Lightly AI-edited |
|
Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a dynamic teleoperation framework that first trains an RL policy in simulation, collects data, and then trains a diffusion policy to assist dynamic teleoperation. An unsupervised 3D representation loss is incorporated during training. Experiments conducted in simulation demonstrate the effectiveness of the proposed learning approach.
1. The direction of dynamic-catch is interesting, which is an under-explored direction.
1. The main pipeline—training an RL policy in simulation, collecting data, and training a policy that takes glove signals as input—is largely similar to Dexterity Gen. Therefore, Dexterity Gen should be considered the baseline rather than simple teleoperation.
2. The MSE metric does not provide meaningful insight when training the diffusion policy; the success rate should be included in the ablation study to better evaluate performance.
3. The paper only presents simulation experiments without any real-world validation. Given the dynamic catching setting, the sim-to-real gap can be substantial. As a result, the reported improvements may not transfer effectively to real-world scenarios, which limits the demonstrated effectiveness of the proposed method.
4. The overall technical novelty of the paper is not very strong. It resembles a system paper that builds upon Dexterity Gen with additional designs for dynamic catching and unsupervised point cloud learning. The work may be more suitable for a robotics-focused venue rather than ICLR.
1. The current results show that combining teleoperation with a pretrained policy performs better than teleoperation alone. However, it is unclear how an autonomous policy—without any glove input—would perform. What would be the success rate of such a policy? If the autonomous policy achieves a higher success rate than teleoperation combined with the pretrained policy, what is the significance of integrating teleoperation?
2. Regarding DAIM, as I understand it, the lowercase k represents the environment step, while the uppercase K denotes the denoising horizon (e.g., denoising for 10 steps to obtain the final result). I am confused about how these two are combined and used in practice. |
Lightly AI-edited |
|
Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Tele-Catch, a framework for dynamic object catching using dexterous hands, a challenging and underexplored task compared to static manipulation. Its core contribution is a shared autonomy system that seamlessly fuses human teleoperation with an autonomous diffusion policy. This is enabled by a novel dynamics-aware mechanism (DAIM) that adaptively integrates human input, and a policy (DP-U3R) that leverages geometric point cloud representations for robustness. The key advantages are its successful tackling of the dynamic catching problem, the effective human-robot synergy it creates, and its demonstrated generalization across different hand embodiments and unseen objects.
1. The topic investigated in this paper is highly interesting and challenging.
2. The proposed method demonstrates significant insight by effectively combining the strengths of teleoperation and existing learning-based approaches.
3. The paper provides comprehensive experimental validation to verify the effectiveness of the method and its design.
1. Why can h_t be directly incorporated into the diffusion policy's denoising process? Could this potentially disrupt denoising, as the training data might not have encountered such a conditioned input?
2. Some information, such as the object's linear and angular velocity, is difficult to obtain in real-world scenarios. Are there any practical solutions and corresponding experiments to address this limitation?
3. How were the dynamic objects configured in the simulation? Please specify the parameter ranges used for their linear and angular velocities.
4. Comparing action noise error is not very meaningful. A more critical metric is the task success rate, particularly in comparison to baselines like DP and DP3.
5. The cross-embodiment experiments are insufficient. More extensive validation is needed.
6. The number of test objects is too limited and should be expanded.
7. The supplementary material only shows results for the teapot. More visualizations are required to effectively demonstrate the validity of the actions.
I will consider raising the score if the rebuttal of the author can address the above concerns.
please see the weaknesses |
Lightly AI-edited |
|
MuEdit: A Lightweight yet Effective Multi-task Model Editing Method |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes MuEdit, a lightweight and effective method for multi-task model editing. The authors argue that existing model editing approaches suffer from strong interference when updating multiple tasks simultaneously. To address this, they introduce a novel metric called the Conflict Index to quantify conflicts between task-specific editing objectives.
Based on this metric, they design two strategies Optimal Editing Order Selection, and Conflict-Guided Low-Rank Matrix Approximation to solve this problem.
Extensive experiments on multiple benchmarks and two model (Llama3-8B and GPT2-XL) demonstrate that MuEdit outperforms state-of-the-art methods such as ROME, MEMIT, AlphaEdit, and AnyEdit, while maintaining strong general-domain capabilities.
1. Novel problem formulation – The paper is the first to explicitly define and analyze multi-task model editing from a null-space conflict perspective.
2. The theoretical foundation based on linear algebra (null-space and rank analysis) is sound and logically consistent, which is an interpretable approach.
3. This paper Covers five heterogeneous tasks, Includeing ablation studies, sensitivity analysis, and significance testing (p < 0.05). MuEdit achieves substantial improvements in multi-task editing and maintains general-domain abilities better than all baselines.
1. Although the Conflict Index is an interesting idea, it is heuristic and lacks a rigorous theoretical connection to optimization conflicts (e.g., gradient interference or Fisher information).
2. The “optimal editing order” involves a factorial search over tasks (O(N!)); the paper does not clarify how this is handled in practice.
3. The method assumes all tasks are known beforehand; it is unclear how Mu-Edit performs when new tasks arrive incrementally.
4. Results are shown on GPT2-XL and Llama3-8B; it remains uncertain whether the conclusions hold for larger models like Llama3-70B.
1. How scalable is the Conflict Index computation and order search when the number of tasks exceeds 10?
2. Does low-rank approximation reduce the model’s knowledge capacity, potentially leading to long-term forgetting? |
Fully AI-generated |
|
MuEdit: A Lightweight yet Effective Multi-task Model Editing Method |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The author proposed a novel concept termed the Conflict Index, which quantifies the degree of conflict between the editing objectives of two tasks. Building on this idea, the author introduced a method that integrates two key strategies: 1) optimal edit path identification; 2) a low-rank matrix approximation method based on the conflict index to expand the null-space dimension.
- The author provides a clear formulation of the multi-task editing problem and introduces the Conflict Index, an insightful and valuable concept.
- The idea of leveraging the common null space and employing low-rank matrix decomposition to mitigate task conflicts is both inspiring and technically interesting.
- The paper strongly lacks analysis and experiment to support its idea.
(1) No further experiment to support the key observation of this paper, which is that during sequential multitask editing, the new knowledge matrix Kn compresses the null space of Kn−1 (in Sec. 3.2) after the teaser figure.
(2) In the Sec. 4.1 and the appendix, the main experiment was still conducted on the Llama3-8B and GPT2-XL, which is a pretty old combination. The author should add more experiments on SOTA LLMs like Qwen2.5.
- The proposed method mainly addresses the sequential editing scenario, which corresponds to lifelong model editing in practice. However, the Best Order concept introduced in Sec. 3.3.1 is not realistic in real-world applications, as the future knowledge to be edited is inherently unpredictable. If multiple pieces of knowledge are already available as a batch, conventional fine-tuning would be a more appropriate choice. This, however, contradicts the core motivation of knowledge editing, which is to enable efficient and localized updates for small pieces of knowledge at a time.
See above |
Lightly AI-edited |
|
MuEdit: A Lightweight yet Effective Multi-task Model Editing Method |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper points out that existing knowledge editing methods cannot effectively handle multi-task editing, and conflicts exist between different tasks, which affects editing performance. This paper proposes a multi-task editing framework that uses two complementary strategies to resolve multi-task conflicts.
1. This paper is well-written, with clear logic and concise readability.
2. It conducts extensive experiments, including comparative experiments on models of different types and scales.
3. The authors provide a wide range of comprehensive evaluation metrics.
1. First, the paper points out that Mu-Edit relies on low-rank decomposition to achieve editing across multiple tasks, but lacks comparative results with full-model fine-tuning and LoRA.
2. Second, the motivation is insufficient. Knowledge editing is often used for low-cost knowledge updates. It remains unclear whether the setting of performing knowledge updates across multiple tasks is reasonable, and what advantages this setting offers over full-model fine-tuning or LoRA.
3. Finally, previous works (such as D4S [1] and AlphaEdit [2]) have addressed the issue of model performance degradation after editing multiple samples. Based on the existing experimental results, Mu-Edit fails to demonstrate this capability, which casts doubts on its practical application.
**References**
[1] Reasons and Solutions for the Decline in Model Performance after Editing
[2] Alphaedit: Null-space constrained knowledge editing for language models
See Weaknesses. |
Moderately AI-edited |
|
MuEdit: A Lightweight yet Effective Multi-task Model Editing Method |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper zeroes in on a pretty practical problem in model editing how to update a model for multiple tasks at once without everything falling apart. The authors argue, pretty convincingly, that the interference comes from conflicting editing objectives. Their big idea is a "Conflict Index," a new metric to quantify how much two tasks' null-spaces clash. Based on this, they propose Mu-Edit, which is a two-part strategy. First, it figures out the best sequence to apply the edits to minimize total conflict. Second, if the clash is still too severe, it actively expands the common null-space by running a low-rank approximation (SVD) on the knowledge matrix of the most "conflicting" task. The experiments on a few multi-task benchmarks seem to back this up, showing it preserves performance better than existing methods.
1. The paper addresses an important and under-explored problem of multi-task model editing, which is more realistic than sequential single-task editing.
2. The introduction of the Conflict Index provides a quantitative way to measure and analyze conflicts between different editing tasks.
3. The proposed optimization strategies (optimal editing path and low-rank approximation) are well-motivated and appear to effectively address the multi-task conflict problem.
4. The method demonstrates strong empirical performance across multiple tasks while maintaining general model capabilities.
1. The $O(N!)$ complexity for finding the best edit order is a major scalability problem. The practical greedy solution is hidden in the appendix.
2. SVD is a blunt tool. The long-term, cumulative impact of repeatedly cutting rank on multiple tasks isn't really explored.
3. The method seems fragile. The worst-case example (43.7% reduction) is dangerously close to the 45% failure point, suggesting it could easily break.
4. The reliance on a large, static K matrix for each task feels brittle and may not handle evolving tasks or unseen knowledge well.
1. The $O(N!)$ order search is impractical. Is the greedy algorithm from the appendix the intended method? What about other ordering heuristics?
2. Regarding the SVD, your worst-case (43.7% reduction) is right at the 45% performance cliff. What happens when a task pair requires a 50% reduction? Does the method just fail?
3. Also, why did performance get worse in Table 9 when optimizing over 4 or 5 tasks instead of 3? This seems counter-intuitive and suggests a potential unaddressed issue. |
Heavily AI-edited |
|
SortedRL: Accelerating RL Training for LLMs through Online Length-aware Scheduling |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a new RL post-training framework for language models, which introduces an online length-aware scheduling strategy that sorts rollout samples by length, prioritizing shorter generations. The method also includes a mechanism to manage partial trajectories, which can be fully on-policy (partial trajectories are discarded) or partial mode (partial trajectories are retained and resumed by the updated policy). The authors then conduct experiments on logical reasoning and math benchmarks and show the improved training efficiency, with relatively neutral performance gain.
- Significance: the paper addresses an important bottleneck in scaling RL training for reasoning models.
- Methodology: the proposed method, on a high level, is reasonable and easy-to-understand.
- Clarity: The method is described almost entirely through high-level qualitative descriptions. There are no formal algorithm blocks or pseudocode to define critical components like the "oversubscription strategy," "early termination" logic, or exactly how the "length-aware controller" manages the queue. This makes the mechanism ambiguous. Please provide some formal algorithm blocks in the paper to help people understand in details. Example: [[Phuong and Hutter, 2024](https://arxiv.org/pdf/2207.09238)].
- Reproducibility Issues: As noted, without precise algorithmic descriptions or provided code, it is unclear how practitioners can re-implement the proposed method. Please consider open-source the code or provide training details as clear as possible.
- Insufficient analysis of non-i.i.d. updates: The core idea relies on sorting data by length, which introduces a strong bias (non-i.i.d. batches) into the gradient updates (the "micro-curriculum"). The paper lacks a rigorous analysis or ablation study on how this specific bias affects the convergence, limiting the scalability of the proposed method.
Please see the weakness section. I am happy to raise my score if the **reproducibility** concerns are substantially addressed. A method that cannot be replicated by the community cannot not provide the reliable knowledge necessary for conference acceptance. |
Fully human-written |
|
SortedRL: Accelerating RL Training for LLMs through Online Length-aware Scheduling |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a scheduling strategy in batch RL training systems to improve training efficiency. The core strategy is to start a large batch for generation and consume the batches according to the order of lengths. Experiments are taken on mathematical reasoning and logic domains. Experiment results show that the proposed strategy improves training speed of vanilla RL training.
1. The paper is well motivated as the bubble issue is a well-known issue in RLVR training.
1. Description of the central components of the proposed strategy is unclear, especially in Sec 3.1.
2. There is a lack of direct comparison with existing speedup techniques in RLVR training, including one-step-off RL training used in DeepCoder project [1] and fully asynchronous RL training in AReaL [2]. It is expected to have a comparison of the speedup with existing asynchronous training approaches.
3. The proposed strategy does not seem robust to the design choices, as evidenced by the collapse of partial mode in the logic RL experiment and failure of using a large batch size for rollout in the ablation study (Sec 4.4.3)
[1] Luo, M., Tan, S., Huang, R., Patel, A., Ariyak, A., Wu, Q., ... & Stoica, I. (2025). Deepcoder: A fully open-source 14b coder at o3-mini level. Notion Blog.
[2] Fu, W., Gao, J., Shen, X., Zhu, C., Mei, Z., He, C., ... & Wu, Y. (2025). AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning. arXiv preprint arXiv:2505.24298.
1. How does the proposed approach compare with asynchronous RL training approaches such as AReaL and one-step-off used in DeepCoder project in terms of training speed?
2. Why does the partial mode fails in logic RL experiment in Fig. 3? Considering the failure of partial mode, why not consider using the decoupled PPO objective in AReaL?
3. Could you provide experiment results showing the scaling property of the proposed strategy? That is, how would the training throughput change with different number of GPUs? |
Fully human-written |
|
SortedRL: Accelerating RL Training for LLMs through Online Length-aware Scheduling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper develops an online length-aware scheduling policy called SortedRL to improve rollout efficiency and maintain stability during training. It improves RL efficiency by sorting rollout samples by length for policy updates, thus directly tackling the rollout bottleneck in traditional RL algorithms.
Key components of SortedRL:
1. Online length aware scheduling: The controller sorts the incoming rollouts by lengths and updates the policy through early-termination once the update batch size has been reached. This leads to shorter sequences being prioritzed and over the course of a full rollout batch, a micro-curriculum is formed.
2. Controlled off-policyness: SortedRL supports two modes of operation: fully on-policy and partially on-policy. In the fully on-policy mode, fresh rollouts are generated for the cached prompts after each policy update. In the partial on-policy setting, caches the tokens and logits of incomplete rollouts and continues them after the policy update.
SortedRL results in an improved bubble ratio (74% in baseline to 6% in SortedRL) and attains superior performance over the baselines.
- SortedRL is a novel framework designed to alleviate the significant rollout bottleneck in RL and address the instability introduced by off-policy updates that come with large rollout batches.
- This system of sorting rollouts by output lengths for updates is intuitive and improves both hardware efficiency (lower bubble ratio) and sample efficiency (improved performance at earlier steps) through a higher-degree of on-policyness.
- The paper includes significant quantitative results to show the effectiveness of SortedRL on logic and math tasks and also provides an insight into the changes training dynamics. The ablations on grouped rollouts, fully on-policyness and groups size do a good job of furthering our understanding of this technique.
- The paper is for the most part clearly written and well-presented.
- A significant implicit assumption in the paper is that longer rollouts == harder prompts which is what enables the micro-curriculum. This largely holds true for math and reasoning tasks where longer rollouts mean longer, richer reasoning chains. However for other tasks like summarization, general instruction following, safety alignment etc. this is not necessarily true. The effectiveness of SortedRL on such tasks is unclear.
- The paper would benefit from a deeper analysis into why SortedRL in the partial on-policy setting catastrophically failed so early on when training on LogicRL. This limits the usefulness of the partial on-policy mode and makes it difficult to use. Analyzing when the partial on-policy model should or should not be used would help with understanding and usability.
- The new hyperparameter (group size) introduced in this work significantly impacts test time accuracy. The included ablation shows that larger group sizes result in imbalanced updates. However, it is not clear if or how the optimal group size changes with the training set. More analysis on group size and potentially some intuition or heuristic to determine an optimal group size would be welcome.
The AIME24 numbers in lines 367-368 don't match those in Table 1. I am assuming the table is the source of truth but it would be good to correct it in the text. |
Fully human-written |
|
SortedRL: Accelerating RL Training for LLMs through Online Length-aware Scheduling |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents SortedRL, an online length-aware scheduling strategy designed to address the efficiency bottleneck in reinforcement learning training for large language models. The core innovation is reordering rollout samples based on output lengths, prioritizing shorter samples for early updates while enabling large rollout batches and flexible update batches simultaneously. The method includes controllable off-policy training mechanisms and a co-designed RL infrastructure with a stateful controller and rollout buffer. Experiments on `LLaMA-3.1-8B` and `Qwen-2.5-32B` on logical reasoning and math reasoning tasks demonstrate significant reductions in bubble ratios (from $74\\%$ to $<6\\%$).
- **Important problem:** Addresses a real bottleneck in RL training for LLMs, where rollout can consume up to $70\\%$ of training time for long sequences (16k tokens)
- **Comprehensive system design:** Goes beyond algorithmic contribution to include infrastructure components (length-aware controller, stateful buffer) that enable the approach
- **Thorough ablation studies:** Figure 6 provides useful ablations showing the importance of grouped rollout and on-policy vs. post-hoc sorting
### Major Concerns
1. **Missing wall-clock time comparisons:** While Figures 3 & 4 show results vs. update steps, and Figure 5 shows throughput, the paper lacks end-to-end wall-clock time comparisons for the full training runs. Given that SortedRL introduces additional complexity (sorting, buffer management), it's crucial to see actual training time on the x-axis to understand real-world gains. The throughput gains in Figure 5 don't directly translate to total training time savings.
2. **Model selection concerns:** The choice of `Qwen-2.5-32B` for math experiments is problematic given documented data contamination issues for this model on math benchmarks (see https://arxiv.org/abs/2506.10947, https://arxiv.org/abs/2507.10532v1)
The paper only tests 2 base models (`LLaMA-3.1-8B`, `Qwen-2.5-32B`). Given that prompt/rollout length distributions can vary significantly across different models and tasks, more diverse model testing is needed to validate generalizability
For `Qwen-2.5-7B`, the authors note it shows limited test-time scaling (Fig 9b), but don't explore other 7B-scale alternatives.
3. **Tightly coupled design:** The ablation in Figure 6(a) shows that baseline actually outperforms individual components (Post-hoc, w/o group), suggesting all components must work together. This raises concerns about (1) Sensitivity to hyperparameters (2) Difficulty in adapting the method to different scenarios (3) Whether gains come from the scheduling strategy or from the specific combination of techniques
4. **Limited scope of experiments:** Only two task types (logical puzzles, math problems) are evaluated. Both tasks have rule-based evaluation - unclear if benefits transfer to tasks when generating a verdict is much more time-consuming such as LLM-as-a-judge.
### Technical Issues
5. **Grouped rollout details unclear:**
1. "How many prompts will a batch include?" is not explicitly stated for all settings
2. "How many batches might a single prompt's response be scattered into?" - this is critical for understanding off-policiness but not clearly explained
3. The cache-aware loading policy ("no new prompts are loaded from the dataloader until all cached prompts have been consumed") needs more detail: what is the maximum queuing time or iterations?
6. **Prompt starvation mitigation:** Section 3.2 mentions preventing "starvation of certain prompts" by scavenging, but what if a prompt consistently generates very long responses (e.g., hard problems)? Its response would be scattered across many segments from different policy versions. The paper doesn't analyze the worst-case scenario or provide empirical data on how often this occurs and its impact on performance due to extremely off-policy.
7. **Batch normalization impact:** The paper claims selective batching is "particularly important for algorithms such as Reinforce++, where batch-wise normalization can substantially impact training outcomes" (line 242-243), but doesn't quantify or demonstrate this impact. What specifically makes the normalization sensitive to batch composition?
8. **Off-policy partial mode concerns:** Figure 3(b) shows a drastic, unstable explosion in response length for partial mode that leads to "unrecoverable degradation". The paper doesn't adequately explain why this happens or whether re-calculating importance sampling ratios might mitigate this issue.
### Minor Issues
- **Notation and clarity issues:** Equations (2) and (3): Many symbols lack explanation. Readers unfamiliar with PPO/Reinforce++ will struggle to understand these. The relationship between rollout batch size, update batch size, and group size could be explained more clearly upfront.
- **Possible typo in Section 4.3:** The text states "baseline, on-policy and partial SortedRL achieved 23.33%, 20.83%, and 19.69% accuracy" but then claims this follows "decreasing order of off-policiness (on-policy - partial SortedRL - baseline)". This seems backwards - if on-policy is least off-policy, it should have the best performance, but the numbers show baseline (23.33%) > on-policy (20.83%) > partial (19.69%). Please clarify.
Please address my concerns in the above section. I don't have other questions. |
Fully AI-generated |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The manuscript pursues improvements in 3D spatial awareness on mobile manipulation tasks in the ALFRED simulated environments. The manuscript also claims novelty according to a proposed dual-planning approach, implemented as what is referred to as a ‘Switch Policy’, enabling a short-term planner to interrupt the task execution of a long-term planner if more immediate goal becomes available.
- The paper is well-written and well-organized.
- The paper provides a good amount of experiments, enabling discussion to be had and insights to be drawn.
- The paper considers a compelling task in embodied ai for mobile manipulation.
- L16-17: The manuscript lists, “Spatial-Aware Continual Scene Understanding with a Scene Modeling Module for effective scene modeling…”. This statement is inherently ambiguous without context; doesn’t really add much. Please rephrase.
- Section 3.2: I have some concern that the proposed approach — in particular, the Switch Policy — is tailored to the ALFRED environment. I would feel much more confident if the benefits of the proposed approach could be also exemplified in another simulator/dataset or in the real world.
- Section 3.2: I want to explore why the Switch Policy is necessary. Alternative planner designs are surfacing where a reasoning agent leverages an adapting contextual representation (map, local scene graph, keyframe history) and has a balanced (re-)planning frequency; together, these may provide sufficiently adaptive behavior in a single planner, rather than the two-planner design.
- L119-124: ‘Neural policy’ is not the most satisfactory dimension of comparison between the proposed method and the related work; perhaps a more defining feature of the proposed method can be emphasized in comparison with the limitations in the prior art?
- Table 1: Why do DISCO results change much less dramatically when step-by-step instructions are no longer available, compared to the proposed method? |
Fully human-written |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents SCOUT, which addresses key challenges in autonomous robots performing navigation and manipulation in complex environments and achieve SOTA performance in ALFRED benchmark. SCOUT introduces two main components: Spatial-Aware Continual Scene Understanding with a Scene Modelling Module and Switch Policy, which together achieve coordinating navigation and manipulation. The experiment demonstrates SCOUT’s effectiveness in navigating and manipulating objects on complex, long-horizon tasks with varying degrees of guidance.
SCOUT combines Spatial-Aware Continual Scene Understanding with an adaptive Switch Policy, which allows real-time switching between long-term planning and immediate task handling. This flexibility improves both task success and efficiency. The experimental results on the ALFRED benchmark demonstrate SCOUT's superiority, surpassing previous methods such as DISCO. The experiment design is comprehensive, thoroughly evaluate the SCOUT's performance and effectiveness of each part.
1. There are many powerful vision foundation models (GroundingDINO, DINOv1-v3, SAM, Embodied-SAM, etc) that could achieve similar functionality of scene understanding and mask query module. Though the effectiveness has been proved by the experiments, the motivation for training a semantic segmentation model is unclear.
2. The navigation functionality is too simple; the environment used in the experiment lacks obstacles, almost a clean floor, so the agent could easily move around without any path planning.
3. The AlFRED benchmark cannot catch the latest development of Embodied AI, in terms of visual realism, task complexity and interaction diversity. The author should use some recent challenging benchmark/simulation for evaluation.
Referring to my weakness paragraph. |
Fully human-written |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SCOUT, a unified framework for navigation–manipulation coordination. SCOUT addresses loss of historical context, inconsistent scene representation, and rigid control strategies through (1) Spatial-Aware Continual Scene Understanding module that builds a temporally consistent and semantically rich 3D scene representation using cross-attention between current and historical observations, coupled with a Mask Query Module for precise interaction mask generation without relying on depth estimation and (2) a Switch Policy that dynamically alternates between long-term navigation planning and short-term reactive manipulation when interaction opportunities arise. The proposed method is evaluated on the ALFRED benchmark and achieves state-of-the-art success rates.
- The proposed method achieves strong performance over prior work with large margins.
- Updating the semantic spatial map without depth estimation is intriguing and sensible.
- The proposed semantic spatial map can be learned end-to-end, implying its applicability to other learning-based modules.
- The spatial-aware continual scene understanding module assumes perfect actuation of the robot, assuming that the robot can move on to the adjacent cell with no errors. How sensitive is the scene understanding module to these errors? And, does the proposed method still work with the imperfect actuation?
- The authors argue that previous depth-object-mask co-projection (L050) causes error accumulation from inaccurate depth-semantic lifting, but this is not supported by any evidence. Relevant quantitative analyses are missing.
- The Switch Policy is inspired by a specific failure mode in a downstream task, raising a concern of its generalizability. What if simply making FOV bigger? Are the Switch Polity still needed in this case?
- In the Switch Policy, the short-term planner predicts a binary indicator if the current status is manipulable or not given an egocentric observation. Why not just use semantic segmentation masks? If the agent can manipulable, there should be some objects manipulable in its view and their masks should be detected, accordingly.
- SCOUT is sensitive to the choice of the grid size as mentioned in Sec.4.3, specifically tuned to a downstream task. It is nontrivial to determine the hyperparameter for specific downstream tasks.
- The proposed approach is validated in a single benchmark, raising its generalizability concern. Can the proposed method be used for other types of embodied mobile manipulation, such as HomeRobot, TEACh, etc.?
- Can the proposed approach be extended to other datasets with unknown camera parameters?
- How much computational cost is needed for high resolution of the semantic map, given that space complexity is $\Theta(HW)$?
- In Table 3, what is "Image Semantic Seg."? Is it from a pretrained semgnetaion model? Its description is unclear. In addition, Are both "Map" and "Image" modules learned with the same training dataset? |
Fully human-written |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses embodied mobile manipulation on the ALFRED benchmark. The authors argue that prior agents suffer from three coupled issues: (1) historical information is lost when policies act only from the current egocentric view, (2) scene representations are inconsistent because 2D predictions are lifted to 3D through noisy depth, and (3) execution is rigid because the agent cannot interrupt a long navigation plan when a nearer manipulable target appears. The proposed method, SCOUT, has two main components. First, a Spatial-Aware Continual Scene Understanding module maintains a BEV-like 3D scene feature over time using spatial cross attention (projecting 3D points into the current image to fetch the right 2D features, thus avoiding depth estimation) and temporal cross attention (fusing the previous scene feature to preserve memory). It also has a mask-query module that pools object-specific 3D features and aligns them with 2D image features to produce pixel-level interaction masks. Second, a Switch Policy combines a rule-based long-term planner (BFS over the semantic scene map) with a learned short-term planner that can interrupt navigation whenever an object is both semantically correct and spatially reachable. On ALFRED, this yields higher success rates than prior work, including DISCO, in both seen and unseen settings and improves path-length–weighted metrics.
1. The paper is well motivated: it spells out three concrete failure modes in existing embodied agents (history loss, inconsistent 3D grounding, non-adaptive execution) and maps each to a specific component of the method, so the design is coherent.
2. The scene-understanding part is a sensible adaptation of BEV-style and deformable-attention ideas to single-view, temporally accumulated embodied data: instead of lifting 2D to 3D with predicted depth (which causes error accumulation), it pulls 2D features from projected 3D points, and then keeps temporal consistency with a dedicated temporal cross attention.
3. The Switch Policy directly targets a real ALFRED failure case: once the agent turns, a closer instance of the target may appear, and executing the long plan is wasteful. The proposed dual planner (long-term BFS + learned short-term classifier) is a simple but effective way to cut extra steps, which is supported by higher path-length–weighted metrics and the qualitative example.
4. Ablation studies are thorough. Removing temporal cross attention causes large drops; removing the switch policy reduces both success and efficiency; using ground-truth masks gives only small gains, which means the proposed mask-query module is already strong. This makes the main result credible.
5. The method achieves clearly better numbers than strong baselines under both step-by-step and goal-only instruction settings on the official test split, which is nontrivial for ALFRED.
1. On the perception side, the contribution is mostly integrative. The method reuses established ingredients (BEV-style scene feature, deformable attention, 3D→2D projection, 2D–3D feature alignment) and repackages them for ALFRED. The novelty is more in the way these are combined and supervised than in a fundamentally new learning component.
2. The switch policy is only partly learned. The long-horizon component is still a hand-designed BFS over the semantic map, and only the short-horizon “is this manipulable now?” part is trained. This makes the contribution feel somewhat engineered and raises questions about portability to other simulators or to real robots where the semantic map is noisy.
3. The evaluation is confined to ALFRED. Because the method is tuned to ALFRED’s discretization (25 m × 25 m, 25 cm grid, 100 × 100 best) and to its task structure, the generality of the approach is not fully demonstrated. Even a small experiment on a second embodied benchmark would strengthen the claim.
4. The model is relatively heavy (100 × 100 grid, spatial and temporal deformable attention, two segmentation losses), and training uses 8 GPUs for a day, but there is no careful runtime/latency comparison to prior work. For practical embodied deployment, this information would be useful.
5. The paper itself admits that it cannot handle open-vocabulary objects or reason about objects hidden/contained inside others, which are active directions in current embodied AI.
1. Can the switching decision itself be learned end-to-end (for example, via RL over the two planners) rather than partially hand-coded?
2. How sensitive is performance to the 100 × 100 grid if room sizes or step sizes change? You show that 80 and 120 are worse, but would a different dataset require re-tuning this resolution? |
Fully AI-generated |
|
Towards Universal Neural Inference |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focues on the tabular data learning and proposes to 1) use set-based transformer to model the arbitrary orderings in tabular data; and 2) leverage meta-information across datasets by a semantic grounding module. The proposed ASPIRE first maps similar features into a shared semantic space, and then develop the "atoms" for each (feature, value) pairs. The set-based transformer is then used to capture feature interactions while maintaining permutation equivariance. Extensive reuslts on tabular data classification and regression tasks show the improvements of the proposed model.
1), The idea that views the tabular data as "atoms" and apply the set transformer to extract the tabular feature is novelty in this community and the results on tabular data show the effectiveness of this idea.
2), The developed two-level aggregation is interesting, and this strategy is helpful to capture differenct information from various levels.
3), The proposed model develops different value embedding methods for categorical and continuous data types, which help to extract the feature more correctly.
1), One of the main concerns comes from the complex preprocess of semantic feature grounding and feature-value atom processing, which may limit the application of the proposed model in practice.
2), Additional results of TabLLM [1] and FeatLLM [2] need to be inclound to analysis the performance of the proposed model.
3), The choice of datasets is small and non-standard; using a large established benchmark in the literature would lend more credibility to the findings [https://arxiv.org/abs/2305.02997].
[1] Tabllm: Few-shot classification of tabular data with large language models.
[2] Large language models can automatically engineer features for few-shot tabular learning.
Please see the Weakness section. |
Fully human-written |
|
Towards Universal Neural Inference |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes ASPIRE, a foundational model for classification, regression and imputation for tabular data. The paper pre-trains the model on real data and in particular takes column names, column types and dataset description into account. The authors show that ASPIRE outperforms TabPFN on few-shot prediction on 15 standard benchmark datasets.
- The paper adds to the recent work on tabular foundation models, which is a critically important and emerging field.
- The paper in particularly is able to make use of column metadata and dataset metadata, features that are rare in foundational models for tabular data, i.e. they are not present in TabPFN, TabICL and Limix, though they are present to some degree in TabDBT and CARTE.
- The paper does not compare with any of the recent models that address the same problem, in particular CARTE and TabDPT. Both of these models are able to incorporate column-level meta-data and have been evaluated in the few-shot setting. The only state-of-the-art model evaluated in this paper is TabPFN (assuming this is TabPFN V2, though please clarify if this is the case).
- The presentation of the paper is extremely confusing, in that it emphasizes the set transformer aspect, and the ability to work with varying schemas. Both of these properties are common, and maybe the defining properties of table foundational models, including TabPFN, TabPFNV2, TabICL, Limix, TabDPT, CARTE, and others. This is assuming the goal is to transfer knowledge between datasets with different schemas (which all of these models do), not having varying schemas within the same dataset (which these models do not). If the latter is what is meant in this paper, this should be clarified more, and should be motivated. All of these methods are set transformers with respect to the samples, and some of them are also set-transformers with respect to the features (not TabPFNV1, but TabPFNV2 is, as well as Limix and TabICL, which all have learned embeddings to distinguish columns, so they are not invariant to column reordering. CARTE is invariant IIRC). GAMFormer is completely equivariant with respect to features and samples.
- The masking scheme and universal filling in described this paper is novel relative to TabPFN and TabICL, but very similar to the one used in Limix, and a comparison should be made there.
- CARTE uses a column name embedding somewhat similar to the one described in this paper and a more clear comparison should be made.
- The selection of benchmark datasets is somewhat unclear. Not using a standard benchmark like TabArena, Tabzilla, Talent or OpenML CC-18 makes comparison to other works harder and opens the possibility of dataset selection.
### Minor notes
- Line 203: the reference to the figure seems broken and it's unclear what figure is references.
- Definition 1: the definition of permutation is unclear. Usually a permutation is a function \pi: {1, ... , n} -> {1, ..., n}. A common form is also writing a permutation matrix P_{\pi}, but in neither case would \pi(x) be defined. I think a definition of \pi(x)_i = x_{\pi(i)} should be added (I assume this is what is meant here, i.e. the coordinates are permuted).
- Did you use TabPFN V2 or V1 for the comparison?
- Is there a reason not to compare to TabDPT and CARTE that solve the semantic few-shot task as well?
- What is the difference in the masking scheme used in ASPIRE compared to LimiX?
- By "fixed schema" did you mean "fixed schema" across datasets or within a dataset? How is the generalization of ASPIRE different than the one in the other foundational models? |
Fully human-written |
|
Towards Universal Neural Inference |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces ASPIRE (Arbitrary Set-based Permutation-Invariant Reasoning Engine), which aims to work with heterogeneous tabular data. It tries to solve a few problems: 1. generalize across feature names, meanings 2. work on unseen datasets either with zero-shot or few-shot prediction 3. remain permutation invariant. The core method is a set of different semantic encodings of categorical and continuous features and the usage of set transformer. It conducts simple experiments to show ASPIRE performed better than other models on a number of evaluation datasets.
1. The introduction of set modeling: The authors believe the set modeling is the right framework and proposed the framework that handles this modeling approach. However, please see the first point of weakness section as well.
2. The author conducted experiments and compared against LLM, TabPFN and CM2 baselines under few-shot setting which were not done in the prior works.
3. The introduction of fourier transform and set transform together in this setting is novel
### Originality and Significance
1. The significance of set modeling is not fully justified.
* What is the main purpose of using set modeling and in what specific applications is non-set modeling undesirable?
* Isn't XGBoost or tree-based model already modeling a set?
### Quality and Clarity
1. Missing references: TabICL, TabDPT should be compared to or at least cited.
2. [line 203]: Figure C1 does not appear to be labeled
3. How the fourier transform is done exactly (with or without learned frequency) and how the set transform works should be included in the background.
4. Missing confidence intervals in Table 1 and 2.
5. TabPFN is in 5 shot and 0 shot but not in finetune section. Similarly, XGB is in fine-tune section but not in 5-shot. (Can we still train XGB on just 5 samples?)
1. Section 4.3: Is the fourier transform parameter learned or fixed? This details needs to be specified.
2. How is TabPFN used for 0-shot? (TabPFN requires at least 1 context sample since the query token does not attend to itself.)
3. Why are F1 score and RMSE chosen for Table 1 and 2? Do we get similar result with AUC and R2? |
Fully human-written |
|
Towards Universal Neural Inference |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on the task of constructing foundation model on tabular data. Specifically, the proposed method aggregate semantic grounding, atom processing, set-based instance representation and universal inference to train a foundation model, which then could generalize to different unseen datasets without fine-tuning.
S1: The studied problem is important.
S2: The paper structure is clear and easy to follow.
W1: Towards the motivation. First, the three mentioned challenges seem to be mentioned and solved by previous studies when constructing foundation model. At least, they mentioned these challenges. For example, GTL[1] indicated the first challenge of schema heterogeneity. TP-BERTa[2] indicated the second challenge of feature permutation invariant. GTL[1] also consider the third challenge of semantic grounding. Thus these are the general challenges when constructing tabular foundation model. What unique challenges does this paper solve, which are not solved by previous papers?
W2: Towards the technical novelty. It seems that the proposed method is an alternative solution for these existing challenges compared to previous studies. The proposed method has four main components, i.e., semantic grounding, atom processing, set-based instance representation and universal inference. In detail, many papers, e.g., GTL[1] and TP-BERTa[2] have considered the semantic grounding. TP-BERTa[2] have proposed the similar idea to atom processing, that converts feature-value pair to one unit. TP-BERTa[2] and TabPFNv2[3] also use permutation invariant prediction, which are similar to set-based instance representation. Finally, the core idea of universal inference detailed in Eq.1, is similar to the core idea of GTL[1]. The authors need to justify what additional improvements or unique advantages they achieved compared to the existing methods.
W3: Towards the technical details. Some details need to be further explained. For example, (1) in what case the output dimension would be the same as input dimension, according to Definition 2? (2) In Eq.2, why E_dtype is learnable? it seems that it only represents numerical or categorical type. If it is learnable, it will have multilple (continues) status rather than binary status. (3) For one specific feature, how are E_dtype and E_choices initialized?
W4: Lack of comparison between sota foudation models. In experiment, is the used TabPFN the first version or the second version? The recent second version of TabPFN is ready for regression. And recently, there are many foundation models following TabPFN, like TabICL[4]. In addition, TP-BERTa[2] and XTab[5], which are mentioned in related work, do not be considered. And maybe also compare with GTL[1].
W5: Lack of comparison between sota few shot learning. At least TabLLM, mentioned in the related work, should be compared, though it only supports classification.
[1] From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models. KDD'24.
[2] MAKING PRE-TRAINED LANGUAGE MODELS GREAT ON TABULAR PREDICTION. ICLR 2024.
[3] Accurate predictions on small data with a tabular foundation model. Nature 2025.
[4] TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. ICML 2025.
[5] XTab: Cross-table Pretraining for Tabular Transformers. ICML 2023.
Please see weaknesses above. Need authors to solve all of these problems. As for W4 and W5, the authors should at least give compelling reasons why these baselines are not considered. |
Fully human-written |
|
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces H2IL-MBOM, a framework for opponent modeling designed to address non-stationarity in multi-agent adversarial environments. The method's core is a hierarchical world model that mimics human cognitive processes by decomposing the complex task of reasoning about an opponent into two levels. At a high level, the model infers an opponent's macro "intention" by analyzing their historical trajectories. Subsequently, at a low level, it uses this inferred intention as a condition to deduce the specific "latent strategy" the opponent is employing, taking into account the reactions and movements of allied agents. This framework is implemented through a complex neural architecture based on Transformers and Hypernetworks (HyperHD2TSSM) and is used to guide a PPO-based reinforcement learning agent. The authors report that their method demonstrates superior performance compared to various baselines in several testbeds, including multi-UAV combat, the StarCraft Multi-Agent Challenge, and Google Research Football.
- The paper introduces a novel approach to opponent modeling inspired by human cognition, decomposing the complex problem into a two-level hierarchy of high-level "intentions" and low-level "latent strategies". This provides a structured and theoretically-grounded new perspective for the field.
1. This paper, in its current state, is difficult to accept. The core issue is not just a matter of style, but a fundamental lack of clarity in its presentation that prevents a proper scientific review. The manuscript is plagued by a host of minor yet cumulative errors that suggest a lack of care in its preparation. For instance, citations are not properly formatted (lacking \citep or \citet), leading to overlaps with the text. There are basic punctuation errors (e.g., missing periods on lines 199 and 218), inconsistent formatting (the acronym HJLGT is sometimes italicized, sometimes not), and poor typesetting (some words are hyphenated across lines between 249-269). Furthermore, the figures are poorly executed; the text in Figure 3 is very small, while the architectural diagrams in Figures 5 and 6 are so cluttered they seem designed more to showcase the model's complexity than to explain it. The overall writing quality, with its convoluted sentence structures and excessive jargon, resembles unedited text generated by a large language model, a practice that should be acknowledged if used.
2. This poor presentation directly obscures the methodology. The main body of the paper has been effectively "hollowed out," with critical information relegated to the appendix. For example, MSOAR-PPO is listed as a key contribution, but its mechanics are entirely absent from the main methodology section. Similarly, the dimensionality of the core latent variables for "intention" and "strategy" — a crucial implementation detail — is only found in a table in the appendix. The reader should not have to be a detective, piecing together the core method from scattered fragments. This forces a reviewer to question the paper's central claims. The entire framework rests on a rigid two-level hierarchy where "intention" dictates "strategy," a strong cognitive assumption presented without justification. The paper also fails to provide evidence that the learned latents, $z_I$ and $z_L$, are actually disentangled. The t-SNE plots are insufficient as they only visualize clustering, not semantic meaning. A more rigorous analysis, such as performing interventional experiments (e.g., fixing the "intention" latent while varying the "strategy" latent and observing the impact on generated trajectories), is needed to validate that these variables meaningfully represent their claimed concepts.
3. The experimental evaluation is similarly unconvincing. The decision to place the results on standard benchmarks (SMAC and GRF) in the appendix is a major flaw that undermines the paper's claims of generalizability; these should be in the main paper. The primary UAV experiment relies on comparisons against opponent modeling baselines (e.g., ROMMEO, PR2) that perform catastrophically. Their complete failure strongly suggests a lack of proper hyperparameter tuning for this complex, continuous-control environment. For a fair comparison, the authors must either provide evidence of a thorough tuning process for these baselines or include stronger, more suitable ones. The ablation study, while showing that components are useful, does not justify the model's immense complexity. The fact that simpler variants (e.g., the only_intentions model from Fig. 4a and especially the transformer_shareadd model from Fig. 4f, which performs nearly identically to the full model) are still effective raises a critical question about the cost-benefit trade-off. The authors should provide a discussion justifying why the marginal performance gain of their full architecture warrants its significant complexity over these simpler, yet competent, alternatives.
1. he prior distribution for an agent's latent state (e.g., $p(z_{I,i,t}|...)$ on page 5) explicitly conditions on the latent states and actions of its neighbors ($z_{I,n_i,t-1}, a_{n_i,t-1}$). How is this neighbor information accessed or communicated between agents during execution, especially within what is described as a decentralized execution paradigm?
2. Appendix A.9 states that the value function for MSOAR-PPO "does not incorporate respective opponents' observations," distinguishing it from MAPPO. However, the policy is conditioned on these observations ($O_{opp,i,t}$). In a centralized training paradigm, why would the critic be deprived of information that is available to the actor, as this typically undermines the core benefit of CTDE?
3. In the scalability tests (Appendix A.11), a 4 vs. 6 engagement resulting in a 3:4 survival rate is described as a success where the smaller team "destroys one more aircraft than the blue team". Could you clarify this interpretation, as a 3:4 result (Red:Blue) means the 4-agent team lost one agent while the 6-agent team lost two, which is not an equal or better kill-death ratio per capita (0.25 vs 0.33 losses per agent)?
4. The reward functions in Appendix A.3 are highly complex. Specifically, the formulas for "reward regarding position of planes" (Eq. 2) and "reward regarding velocity of planes" (Eq. 8) appear to use a very similar calculation for the variable `dd` based on antenna and aspect angles (ATA, AA). Could you explain the rationale for using this angular metric to modulate both a position-based and a velocity-based reward?
5. In the H2TE module (Appendix A.5.1, Eq. 13), the weight $w_{H,i,j,t}$ for agent `i`'s view of opponent `j` is generated from $H_{j,t}$, which is defined as the observation history of opponent `j` relative to *all* N cooperative agents. How does an individual agent `i` get access to the opponent's historical observations relative to its teammates during decentralized execution?
6. The problem is defined as a partially observable one where agents only have local observations. However, the key transition model `HJLGT` (Appendix A.6) and the prior distributions both explicitly use neighbor states and actions as inputs. Does this imply that agents are assumed to have perfect, noise-free observation of their immediate neighbors' states and actions, and if so, shouldn't this be stated as a key assumption in the problem formulation? |
Lightly AI-edited |
|
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies cooperative-competitive MARL. The paper proposes to learn both intention and strategy representations of opponents and utilizes such information to update beliefs and policies/strategies of the agents involved in the game. The paper conducts experiments in several MARL benchmarks including Gym-JSBSim, SMAC, and GRF, in comparison with multiple baselines including both opponent-model free and opponent-model-based MARL algorithms.
1. The paper introduces a new approach to model opponents' decision making (with the goal is to separate intentions from strategies) and integrate it into strategy/policy learning of the agents in the game.
2. Experiments show promising results as the proposed method is shown to perform better than other strong baselines in various benchmarks.
1. Writing and Clarity
The paper is not well written. In particular, Section 3—the main technical section—requires substantial revision. The section consists of long, dense paragraphs that lack a clear and structured explanation of the proposed model. The heavy use of acronyms and lengthy equations further obscures the main ideas rather than clarifying them. More importantly, the intuitions and justifications behind the model’s components, as well as how they connect, are not clearly explained. A concise, intuitive description of the model and its motivation is necessary to make the paper accessible and convincing.
2. Separation Between Intention and Strategy
The paper needs stronger justification for the proposed separation between intention and strategy. The key question is how the proposed components actually capture intention and distinguish it from strategy prediction. The manuscript does not clearly explain what specific mechanisms or characteristics of H2TE-MITD and LHTE-MLTD enable this distinction. The authors should provide clearer explanations to support the claim that these modules can meaningfully separate intentions from strategies.
3. Cooperative–Competitive Setting
Although the paper discusses a mixed cooperative–competitive setting, the proposed approach appears primarily designed to address competitive interactions. It remains unclear how the model effectively handles both cooperation and competition within the same framework.
4. Baseline Performance and Reliability of Results
The reported performance of baseline methods, such as MAPPO on SMAC environments, is significantly lower than in established works (e.g., the recent HAPPO paper). This discrepancy raises concerns about the experimental setup and the reliability of the reported results. The authors should verify their implementations, hyperparameters, and training conditions to ensure a fair and credible comparison.
5. Supplemental Material
The supplemental zip file could not be opened, preventing further examination of the additional materials. Please ensure that the supplementary files are correctly packaged and accessible.
Please address my concerns raised in Weaknesses. |
Lightly AI-edited |
|
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes an opponent modeling method, i.e., H2IL-MBOM, that integrates multi-intention and latent strategy inference into a world model. H2IL-MBOM combines high-level intention inference with low-level strategy prediction to deal with the non-stationary dynamics in multiagent environment. H2IL-MBOM is combined with PPO, which results in MSOAR-PPO. The effectiveness of the method is evaluated on several multi-UAV games.
- the idea of employing world model to the field of opponent modeling is interesting.
- The paper is poorly written. For instance, a lot of concepts, e.g., intentions, mental state, strategies, are lack of clear definition. Many figures in the experimental section are hard to interpret, and the captions are not informative (e.g., Figure 4.). Many abbreviations make the paper very hard to follow.
- Lack of novelty of the proposed method. Similar ideas (e.g., reason about latent strategies based on teammates’ historical responses) have been intensively explored in previous opponent modeling methods, such as [1-3], which are missing out in the Related work section.
- Most of the baselines are not targeting opponent modeling methods, e.g., MADDPG, MAPPO. It is necessary to compare with SOTA opponent modeling methods, such as [1-3].
[1] Greedy when sure and conservative when uncertain about the opponents, ICML 2022
[2] Conservative offline policy adaptation in multi-agent games, NeurIPS 2023
[3] Opponent modeling with in-context search, NeurIPS 2024
- What exactly do you mean by intentions, strategies, mental state? Could you have a clear definition of these concepts? |
Fully human-written |
|
H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy for Opponent Modeling in Multi-UAV Game |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies the decision-making problems in mixed-motive scenarios where cooperation and defection coexist. The paper provides a method, taking into account the nested interaction between agents' (including opponents and allies) intents and strategies. Without relying on other agents’ private information, the method hierarchically infers opponents’ intents and intent-based latent strategies, and predicts their influence on the behaviors of allies. The experiments on Gym-JSBSim, SMAC, and GRF validate the superior effectiveness of the proposed method over existing model-free and model-based approaches.
The paper focuses on both intent-based strategies and the interactions among agents’ intents and strategies. It proposes a transformer-based hierarchical opponent inference and decision-making method within the reinforcement learning framework, and extensive experiments across three environments verify its effectiveness. Overall, the paper presents a substantial amount of work with comprehensive experiments and detailed methodological development and contributes valuable insights to the study of intent modeling and opponent inference in mixed-motive multi-agent systems.
1. It is not easy to follow this paper because of the inconsistent use of symbols and technical terms. See details in **Questions**
2. In mixed-motive games, agents should consider both allies' and opponents' intents and strategies, while the proposed method insufficiently addresses allies' intents and strategies. It may lead to a failure of coordination within the team.
3. The introduction does not include any citations. From the introduction, it is unclear how this project is related to mixed-motive games. Please further refine and reorganize the introduction section.
#####
1. At the high-level, the observation model $p_{\theta_i}$ predicts observations $O_{opp,i,t}$ based on intents $z_{I,i,t}$, while the observations are in turn used by $q_{\phi_I}$ to infer intents. With such coupling, model errors may gradually accumulate. How do the authors address this issue? So does the low-level.
2. In line 269, why does the policy take into account allies' intents and latent strategies? In mixed-motive games, individuals need to model not only their opponents but also consider the behaviors of their allies in order to achieve better coordination.
3. Do $p_{\theta_I}$ in line 257 and $p_{\theta_L}$ in line 266 predict observations rather than trajectories?
4. In section A.3, opponents' relation position, relative velocity, angles and distance relative to others are included in observation, which is inconsistent with the statement in line 224. It says opponents' actions are not observable.
5. There seems to be no fundamental difference between stage 2 and stage 3 in subsection 3.1.
6. The notation used in the paper is somewhat confusing. For example, in line 186, $ a_{i,t}\sim \pi(|o_{i,t}, z_{I,i,t}, z_{L,i,t}) $ shows agent $i$'s action only depend one cooperative agents' $o_{i,t}$, $z_{I,i,t}$ and $z_{L,i,t}$, with $N$ is the number of cooperative agents. It is inconsistent with "cooperative agents update their policies based on trajectory and observations and inferred opponent intentions and strategies" given in section 3. Please modify the problem formulation and unify the notation. |
Fully human-written |
|
Multimodal Datasets with Controllable Mutual Information |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a novel framework for generating high-dimensional multimodal datasets with controllable mutual information. By using flow-based generative models, the method ensures that mutual information between latent variables is preserved, providing a theoretical foundation. The paper also designs a structured causal framework to generate correlated latent variables, derive closed-form analytical formulas for mutual information, and provide examples of synthetic multimodal datasets illustrating different causal and correlation patterns.
1. The paper proposes a framework for generating high-dimensional multimodal data with controllable mutual information, which is rarely achieved in existing public datasets or prior methods.
2. By leveraging flow-based generative models, the approach ensures that the generated data preserves mutual information between latent variables, providing a theoretical foundation.
1. All experiments are conducted solely on CIFAR-10 image data, without demonstrating results on real multimodal datasets (e.g., CMU-MOSI, CMU-MOSEI, or video-text-audio combinations).
2. The paper does not evaluate the generated data on downstream tasks (e.g., regression or classification), making it difficult to quantitatively assess its contribution. It also lacks direct comparison with existing mutual information estimators or multimodal SSL approaches.
3. Some concepts (e.g., the template and flow matching) are not intuitive to non-expert readers, and overall readability could be improved. Moreover, the paper is limited to 8 pages, whereas the ICLR 2026 initial submission allows up to 9 pages.
1. How does the generated data impact performance on downstream tasks, such as regression or classification?
2. Could the authors provide a comparison of their approach with existing mutual information estimators or multimodal SSL methods to better contextualize the contributions? |
Moderately AI-edited |
|
Multimodal Datasets with Controllable Mutual Information |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a framework for generating multimodal datasets where the mutual information between modalities is measurable and controllable. This is useful for lots of works which study the mutual information between modalities and labels in multimodal training dynamics.
- The framework for generating controllable mutual information seems correct and insightful.
- There are a lot of important use cases for this: so many multimodal works look at training based on mutual information. Having it controlled synthetically would be a powerful and useful testbed for that research, and could lead to an important breakthrough in that field.
Unfortunately, I don't think this paper did quite enough to justify that this framework could be used for the strengths I outlined above. A few key points:
- What is the practical utility of this work? You could for example show that your dataset provides training transfer to realistic environments with mutual information-dependent training methods. But without that, how do we know the value of the data you generate with your method?
- If there isn't transfer of performance or key insights from training, what insights can you get by studying models on this dataset, and will those insights transfer to models' behavior on real world datasets? If so, this could be a useful prototyping tool that allows people to run and understand experiments theoretically before doing computationally expensive and confusing training runs on messy real world data. For example, can you show that some findings from prior work are mirrored in your setting, and can be ascertained quickly and reliably, whereas training on a full real world dataset would be costly and noisy?
- How would you simulate handle modality imbalances? Where some data have missing modalities or you have large amounts of unimodal data.
- I didn't understand the black hole example. Could you clarify the motivation? |
Fully human-written |
|
Multimodal Datasets with Controllable Mutual Information |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a framework for generating synthetic data with controlled mutual information using flow-based generative models and a structured casual framework. The illustration of the data generation pipeline is followed by two brief discussions of the example usage in generating synthetic data for different underlying causal structures and scales of modalities.
- The paper is well-motivated, as there has been an emerging interests in multimodal learning from an information-theoretic approach, and this paper provides a well-suited, controlled testbed for such types of research;
- The proposed data generation pipeline is novel, well-documented and clearly explained;
- One major limitation of this work is the lack of empirical evaluation, neither qualitative evaluation (e.g. Figure 2, which the paper also acknowledges that "there is no clear visual connection between these pairs of images") nor quantitative evaluation. This makes it **very hard to verify the correctness** of the proposed framework. In particular, the reviewer does not agree with the claim that "our framework allows us to state unequivocally that these high-dimensional, complex datasets have a specific amount of mutual information" due to this lack of empirical evidence. The paper also does not give any empirical evaluation using the synthetic data generated from the proposed pipeline, so **the claims about the practical utility is also not testified**.
- The reviewer strongly recommends adding more empirical evaluation of the proposed pipeline to show (1) the correctness (e.g. either qualitatively or quantitatively verify the generated data are indeed correlated by the given mutual information) and (2) the utility via a minimal set of evaluations of existing information-theoretic multimodal learning approaches on the generated data, followed by analysis on the results and potential insights that can be meaningful towards multimodal learning research from an information-theoretic perspective |
Fully human-written |