ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	4.00	4.00	2414
Heavily AI-edited	1 (25%)	4.00	2.00	1168
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	2 (50%)	4.00	4.50	2427
Fully human-written	0 (0%)	N/A	N/A	N/A
Total	4 (100%)	4.00	3.75	2109

Title	Ratings	Review Text	EditLens Prediction
From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces Generative Behavior Control (GBC), a new task aimed at synthesizing long-horizon, goal-directed, and physically plausible humanoid behaviors. The authors present two main contributions: * PHYLOMAN framework, which integrates LLM-based planning, a novel Multi-segment Parallel Motion Diffusion Model (MP-MDM), and physics-based control for behavior generation. * GBC-100K dataset, a large-scale dataset combining SMPL motion estimations and hierarchical textual annotations, designed to support and evaluate GBC. Experiments across HumanML3D and GBC-100K are reported, with claims that the method generates more diverse, semantically coherent, and longer motion sequences than prior baselines 1. Ambitious scope: The work reframes the field from motion generation to behavior generation, highlighting the importance of goal-directedness. 2. Scalability: Dataset construction is large-scale, leveraging ∼500k videos and semi-automated annotation pipelines, which could benefit the community if released. 3. Long-horizon motion generation: The MP-MDM parallel generation strategy is technically interesting and addresses efficiency for multi-second or minute-long behaviors. 1. Dataset reliability: * The dataset relies on monocular SMPL estimation (TRAM) as its “gold standard,” which is problematic because SMPL often drifts in translation even when subjects are static (e.g., the provided example clip (`H--TB3aFpxY_000115_000125`) shows the person standing still while SMPL translation varies). This undermines claims of physical plausibility. * Using noisy pseudo-ground-truth motion as the foundation of a benchmark (evaluation target) introduces significant bias; thus, it is questionable whether GBC-100K can be considered a reliable benchmark. 2. Evaluation design flaws: * In Table 2, comparisons confound dataset quality with model design since training and test data are simultaneously altered. To properly assess the dataset’s contribution, training sets should vary while the test set remains fixed. * In Tables 3 and 4, if the evaluation test set is indeed GBC-100K, then performance comparisons may simply reflect train-test overlap. Distribution similarity between train and test splits risks inflating results and obscures real generalization ability. * Physical grounding gap: Despite emphasizing “physics-informed” behavior generation, much of the evaluation remains in SMPL parameter space without showing how the motions transfer to physically simulated humanoids. GBC-100K is based on monocular SMPL estimations, which often have severe physics artifacts. Please see Weaknssses.	Lightly AI-edited
From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Generative Behavior Control (GBC), a new task for generating long-term, goal-oriented, and physically plausible humanoid behaviors. The authors propose PHYLOMAN, a hierarchical framework combining LLM-based high-level planning and physics-informed motion control, supported by the new GBC-100K dataset. Experiments show improvements in behavior diversity, semantic alignment, and motion length compared to baseline methods. - GBC formalizes long-term behavior generation, addressing key gaps in motion generation research. - PHYLOMAN integrates hierarchical planning and physics-based control, bridging high-level semantics and low-level execution. - GBC-100K provides a valuable, hierarchically annotated benchmark for behavior generation. - Claims of goal-orientation and semantic coherence lack rigorous task-driven evaluation. - Comparisons are primarily with motion generation methods, not task-and-motion planning approaches. - Automated annotations may introduce noise; dataset limitations are not fully analyzed. - Lack of detailed ablations to isolate contributions of hierarchical planning and MP-MDM. Please see Weaknesses for details.	Heavily AI-edited
From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a hierarchical framework, PHYLOMAN, for Generative Behavior Control (GBC), combining language-driven planning, diffusion-based motion generation, and physics-based control, and constructs a large-scale hierarchical text-to-motion dataset. 1.This paper introduces a hierarchical framework, PHYLOMAN, for Generative Behavior Control (GBC), combining language-driven planning, diffusion-based motion generation, and physics-based control. 2.The paper constructs a large-scale hierarchical text-to-motion dataset with three levels of structured annotations: BehaviorScript, PoseScript, and MotionScript. 1.While the proposed PHYLOMAN framework is structurally coherent, its components—an LLM-based planner, a motion diffusion model, and a physics controller—are largely based on existing paradigms. 2.Although the paper cites MotionAgent (Wu et al., 2024) as a representative language-to-motion framework, there is no direct experimental comparison and analysis. 3.In the main experimental section, PHYLOMAN is not included in the key comparison Table 2, which presents quantitative results across baselines. The authors should include PHYLOMAN in Table 2 using the same configuration and evaluation metrics as other methods. 4.The GBC-100K dataset, described as containing 123.7K motion sequences and 250 hours of video, introduces hierarchical annotations: BehaviorScript, PoseScript, and MotionScript. While this is valuable, several issues arise: The reported W-MPJPE ≈ 222 mm (Table 7) remains quite large for a high-quality motion dataset. The evaluation only includes PA-MPJPE, W-MPJPE, and RTE,and while the authors acknowledge the presence of typical error types in their data, it is necessary to address additional aspects of physical consistency, such as foot sliding, body penetration, failure ratio, and temporal jitter. There is no analysis of long-horizon temporal consistency, which is crucial for “ultra-long” behaviors. 5. The paper repeatedly refers to the motion diffusion backbone as “parallel-in-time”, implying computational efficiency. However, there is no quantitative evidence (e.g., speedup, training cost, or memory footprint) to support this claim. Please refer the Weakness.	Lightly AI-edited
From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a hierarchical planning and diffusion-based framework (PHYLOMAN), and builds a new dataset GBC-100K for long-term behavior generation. The motivation—bridging semantic intention and physical motion—is relevant and aligned with the community’s long-term goals. - The paper highlights a meaningful research gap, moving from short-term motion generation to long-horizon, goal-directed behavior control, which is conceptually valuable for embodied AI. -The paper proposes a full pipeline combining language planning, motion generation, and physics-based execution. -The proposed dataset GBC-100K is relatively large compared to many prior datasets and includes hierarchical semantic annotations, which could support richer planning and evaluation of long-duration behaviors. - The proposed framework mainly combines existing components: LLM-based behavior planning, Diffusion motion models, and Physics-based controllers. The integration appears incremental without introducing new theoretical insights or algorithmic advances. Claiming a “first unified solution” is overstated, given recent works combining language, motion priors, and controllers. -The dataset is largely auto-annotated using pose estimation with LLM captioning, raising concerns about noise and annotation quality. -The structure of PoseScript with MotionScript is a data-level decomposition; the LLM planner is not grounded in physically valid constraints during generation, contradicting the claimed TAMP formulation. -The extremely long-horizon results rely heavily on truncation and indirect evaluation metrics. Physical plausibility and task completion demonstrations are limited and lack real-world testing or robotics integration. -Since the majority of annotations come from automated VLM captioning and pose estimation, how to evaluate the annotation quality? And how sensitive is the model performance to annotation errors? -The LLM planner is said to enforce physical feasibility (CT). How is this concretely implemented? Is there any explicit consideration in the model to ensure transitions are executable before feeding to the diffusion model? -How is “success” defined for behaviors with abstract semantics, e.g., dancing happily? -Are baselines given text annotations consistent with their original training domains? -Does the system degrade gracefully with longer horizons, e.g., multi-minute outputs?	Fully AI-generated

PreviousPage 1 of 1 (4 total rows)Next