|
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the limited generalization of current 3D human motion generation models and proposes a unified framework that transfers knowledge from video generation models. They introduce ViMoGen-228K, a large-scale dataset combining high-fidelity MoCap data, in-the-wild video motions, and synthetic ViGen-generated samples to enhance semantic diversity. Besides, the authors present a flow-matching diffusion-based model with adaptive gating to fuse MoCap and ViGen priors. Finally, an evaluation benchmark MBench, is proposed for comprehensive evaluation.
* The paper addresses the generalization limitation in 3D human motion generation by offering a comprehensive solution across data, models, and evaluation benchmarks.
* The paper is well-written and easy to follow, with well-organized structure.
* The authors introduce ViMoGen-228K, a large-scale and diverse motion dataset, and MBench, a fine-grained evaluation benchmark. This provides training and evaluation solutions for the field of motion generation, which is beneficial for advancing technological development in this domain.
* Despite the authors claims strong generalization, the paper does not thoroughly examine where the model fails (e.g., on highly dynamic, multi-person scenarios). Analysis on this part is expected.
* While the model is excels at the generalization, there appears to be a trade-off, as it does not outperform all baselines on certain motion quality metrics like dynamic degree. Is there any solution to alleviate this problem?
* MBench relies heavily on VLM-based automatic scoring and curated prompts. I'm curious about the metric‘s robustness and sensitivities. Has the author analyzed these factors?
Please refer to weaknesses section. |
Lightly AI-edited |
|
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a comprehensive framework of data, models, and evaluation to address the generalization capability of 3D human motion generation (MoGen). It leverages prior knowledge from ViGen.
For data, the authors built ViMoGen-228K, a large-scale dataset with 228,000 motion clips. It innovatively fuses high-fidelity optical motion capture (MoCap) data, diverse data from web videos, and long-tail data synthesized by Video Generation (ViGen) models. This wide variety of actions helps improve the generalization of MoGen.
For the model, the authors proposed ViMoGen, a flow-matching-based diffusion Transformer. It uses a novel gated dual-branch (T2M and M2M) architecture to adaptively unify the quality priors from MoCap data and the generalization priors from ViGen models. A distilled, efficient version, ViMoGen-light, is also provided.
For evaluation, the authors designed MBench, a new hierarchical benchmark for the comprehensive and fine-grained assessment of motion quality, text-motion consistency, and especially generalization capability.
Originality:
1. To expand semantic coverage, the dataset leverages a Video Generation (ViGen) model to synthesize long-tail motion data, which is then integrated with traditional MoCap data.
2. The paper introduces MBench, a hierarchical benchmark meticulously designed for the fine-grained evaluation of motion generalization capabilities, featuring a curated open-world vocabulary.
Quality:
1. The paper presents a complete and systematic solution—spanning data collection, model architecture, and a novel evaluation benchmark—demonstrating a thorough and solid investigation.
2. The method's effectiveness is rigorously validated through extensive experiments, including comprehensive comparisons against state-of-the-art (SOTA) methods and detailed ablation studies. Furthermore, the MBench metrics are corroborated by a large-scale human preference study, ensuring their alignment with human judgment.
Clarity:
1. The paper is well-structured and clearly delineates its three core contributions: the dataset (ViMoGen-228K), the model (ViMoGen), and the benchmark (MBench).
2. The manuscript includes high-quality figures that effectively aid comprehension. For instance, Figure 1 clearly illustrates the overall framework and comparative radar charts; Figure 2 intuitively presents the model's dual-branch gated architecture; and Figure 3 vividly details the evaluation dimensions of MBench.
Significance:
1. This work identifies and addresses a critical bottleneck in the field of 3D human motion generation: poor generalization capability. It proposes an effective and integrated solution to tackle this fundamental challenge.
2. The authors commit to the public release of their code, the ViMoGen-228K dataset, and the MBench benchmark. These artifacts will serve as a valuable public resource, poised to stimulate further research and development in this area.
1. ViMoGen-228K leverages a mixture of three data sources: high-fidelity MoCap, alongside in-the-wild and synthetic videos. The filtering process for the in-the-wild data yields an extremely low retention rate (a reduction from 60M to 40k clips). This raises the question of whether in-the-wild video data, in itself, is inherently unsuitable for human motion extraction. Compounding this, based on Table 5, the "Visual Mocap Data" component does not appear to yield significant improvements.
2. Regarding the generalization capabilities discussed (relative to ViGen), the 228k (370h) dataset used for MoGen is still considerably smaller than ViGen's pre-training corpus. How do the authors think MoGen's generalization capabilities compared with ViGen's generalization?
3. The primary quantitative results (Table 2) are reported on MBench, a benchmark concurrently introduced by the authors. While MBench appears to be well-designed, this presents a potential risk: namely, that the new data and the proposed model may have "overfit" to the specific evaluation criteria of this new benchmark.
1. Regarding the data filtering pipeline, neither the main paper nor Appendix D.2 provides a detailed procedure. Key details are missing, such as the specific quality assessment filters employed and the retention ratios (or absolute clip counts) at each filtering stage. This lack of transparency makes the drastic data reduction (from 10M to 40k clips) , raises concerns.
2. In Appendix D.2.3 (Synthetic Data), the authors state they "compiled a list of action verbs and descriptive nouns."
It is recommended that the authors provide this complete list in the appendix and, crucially, conduct a comparative analysis against the vocabulary used in MBench (to ensure the generation vocabulary does not overly align with the benchmark).
Additionally, clarification is required on how potentially ambiguous verbs (e.g., "have") were programmatically handled or disambiguated during text prompt generation.
3. ViMoGen-light: Table 2 reveals a substancial performance gap between ViMoGen-light and the full ViMoGen. The primary source of this discrepancy is not clearly analyzed.
Furthermore, the distillation (referred to as "distill") used to create ViMoGen-light is critically underspecified. What specific distillation methodology was employed (e.g., DMD2, or )?
What performance benefits provide by ViMoGen-light (e.g., computation cost)? |
Lightly AI-edited |
|
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors aim at bridging video generation and 3D human motion generation by proposing a new large-scale dataset ViMoGen-228k, a novel diffusion transformer ViMoGen for unifying the video priors and mocap priors, and a new benchmark MBench for evaluating motion generalizability, motion-condition consistency, and motion quality.
1. Contribution a large-scale motion datasets with 228,000 motion sequences, including high fidelity MoCap data and automatically annotated motion datas from web videos and synthesized videos, which is great for pushing forward the field of motion generation.
1. Great efforts on trying to provide a robust and quantifiable evaluation benchmark for motion generalizability and motion-condition consistency. The proposed based on gated multimodal conditioning is proven to be effective under the proposed benchmark.
1. Poor presentation for qualitative results. Take Fig.4 for example, for several samples, the SMPL frame shots are highly overlapped and the resolution are too low for the readers to compare the visual qualities, e.g. doing jumping jacks, martial arts, climbing a ladder. Similar issues can be observed in the qualitative results figures throughout the paper. Also, the axe or reference for temporal direction should be mentioned alongside.
1. No demo videos are provided. Given the limitations of static frame shot figures, it’s crucial to provide convincing video demos of the generated samples and dataset samples to justify the qualitative comparison. However, the authors haven’t provided any video demos in supplementary materials.
1. In the proposed gating diffusion block, the T2Mbranch seems to be totally parallel with the M2Mbranch, performing totally different tasks with only one shared self-attention layer. And the adaptive branch selection at inference relies heavily on alignment scores from external VLM to determine which branch to go with. It’s then questionable whether the two branch can really benefit from each other in learning the underlying distribution, or the performance is mainly gained from the external VLM assessment and selection for the task to proceed with.
1. The training pipeline is not clear. The curriculum approach mentioned in line 204 - 207 is worth elaborated:
- “we simulate video motion references by perturbing ground-truth motions with controlled noise.” — What is the controlled noise mentioned here? And it’s very confusing to me how perturbing the ground-truth motions can simulate the video motion references.
- For training with text-motion pairs of high quality MoCap data, would there still be a z_video for M2Mbranch? If so, how to ensure the alignment of the z_video and ground-truth high quality motion? The video generated from the same text may contain totally different motion from the ground-truth motion.
1. The details of inference phase is not clear either:
- How to assess if the alignment scores from VLM is high or low? Through a scalar threshold?
- What are the differences and advantages of using the M2Mbranch, other than extracting the MoCap motion directly from the generated videos and performing motion refinement with a SOTA refinement method? Any empirical results on this?
1. The benchmark metrics haven’t been fully evaluated whether it aligns with human perception or not. Although Appendix C.1.2 presents the alignment results for the temporal motion quality, the human preference study for the abstractive metrics (motion generalizability and motion-condition consistency) based on VLM are not provided.
1. The frame-wise quality seems to replace the old distribution-based metrics (e.g. FID) with another “new” distribution-based metrics, except that the definition for the underlying distribution is different. Why would it be better to set up this way? Moreover, any empirical results of why this whole set of motion quality metrics is more reliable and robust than before?
1. The baselines and ablations in Table 3 are not very convincing. Practically, the M2Mbranch is playing the role of a motion refinement module in this context, so the compared baseline should additionally includes text-to-video + a SOTA motion refinement, to clearly evaluate the power of M2Mbranch.
1. To fairly ablate the impact of external VLM assessment during inference, there should also be a specific ablation for not using the gated diffusion block, and instead only use the external VLM alignment scores to select from a SOTA T2M model and a SOTA text-to-video + motion refinement model, and other two ablations with replacing the one of two models with T2M or M2M respectively.
1. Please pay attention to the typos, e.g. line 110, line 363.
Please refer to the weaknesses section for details. If the identified issues and questions are properly addressed, I would consider raising my score. |
Fully human-written |
|
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a comprehensive framework to enhance generalization in 3D human motion generation by transferring knowledge from video generation.
The contributions are threefold.
1) The ViMoGen-228K dataset is introduced, integrating 228,000 high-quality motion samples from optical MoCap, web videos, and synthetic ViGen data to expand semantic diversity.
2) The ViMoGen model is presented, a diffusion transformer through flow matching that unifies MoCap and ViGen priors via gated multimodal conditioning, alongside a distilled variant, ViMoGen-light, for efficient inference.
3) MBench, a hierarchical benchmark, is developed for fine-grained evaluation across motion quality, prompt fidelity, and generalization. Experiments demonstrate superior performance over existing methods.
The strengths of this paper lie in the following aspects:
1) The construction of the ViMoGen-228K dataset demonstrates is encouraging, combining high-fidelity optical MoCap data with semantically diverse motions from in-the-wild and synthetically generated videos.
The multi-stage filtering pipeline encourages a balance between motion quality and semantic coverage.
2) The design of the ViMoGen model's gated, dual-branch diffusion transformer is interesting as well. The adaptive selection mechanism between the text-to-motion and motion-to-motion branches dynamically balances high-quality MoCap priors with the broad semantic knowledge from ViGen models.
3) The paper is very easy to understand and easy to follow. A straightforward presentation of both the data-set and methodology.
4) As for the experiments, I notice the empirical gains on MBench: higher motion-condition consistency (0.53) and generalizability (0.68) than MDM, T2M-GPT, MotionLCM, and MoMask; reduced jitter (0.0108) and foot sliding (0.0064) with analyzed trade-offs in dynamic degree.
My major concern is that
1) this paper lack of key comparison with methods:
(i) MotionCraft (Bian et al., 2025) which shows state-of-the-art performance on text-to-motion on the HumanML3D subset of Motion-X dataset (Lin et al., 2023)
(ii) FineMoGen(Zhang et al., 2023),
(iii) MotionDiffuse both of which are also representative recent approaches and show promising results.
As well, the comparison is only conducted on the proposed MBench benchmark. How about the proposed approach on the widely adopted benchmarks? e.g. text-to-motion on the HumanML3D subset of Motion-X dataset; speech-based motion generation on the BEAT2 dataset; music-based motion generation on the FineDance.
2) The topic and focus of this paper is too broad, while motion generation, has many aspects: speech-based, text-to-motion, music-based. I recommend the authors to narrow down the focus and scope and propose the advantages and difference the previous proposed approaches.
References:
[1] Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, and Qiang Xu. Motioncraft: Crafting whole-body motion with plug-and-play multimodal controls. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 1880–1888, 2025.
[2] Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, and Ziwei Liu. Finemogen: Fine-grained spatio-temporal motion generation and editing. Advances in Neural Information Processing Systems, 36: 13981–13992, 2023c.
[3] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024b.
Some other technical problems:
1) There may exist potential data contamination and selection bias: (i) synthetic video motions are “refined” with a pre-trained ViMoGen M2M branch used later for training, creating circularity; (ii) optical MoCap subset selection is optimized using MBench, risking overfitting the training data to the proposed evaluation.
2) How about the gating mechanism robustness? branch selection depends on VLM-based alignment to video-derived motion; training substitutes real video references with noise-perturbed GT motions, introducing a distribution gap between training and inference.
3) How about the text annotation validity? large portions of text labels are produced by an MLLM from rendered depth/RGB frames with heuristic filtering only; no inter-annotator agreement or systematic quality audit is provided, risking noisy supervision for text–motion alignment.
4) The dataset provenance and ethical are recommended to clarify: the internal video pool and synthetic data prompts are derived from large external caption corpora; consent, licensing of source material, and redistribution rights of derivative 3D motions are insufficiently detailed.
Please see the weakness section. |
Fully human-written |