|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Movo, a kinematics-centric benchmark asking whether text-to-video (T2V) systems generate biomechanically realistic human motion. Movo couples (i) a posture-focused dataset of 10 actions (six lower-body, four upper-body) with camera-aware prompts, (ii) three skeletal-space metrics—Joint Angle Change (JAC), Dynamic Time Warping (DTW), and a binary Motion Consistency Metric (MCM) judged by an MLLM, and (iii) human validation via pairwise preferences. Using these, the authors evaluate 14 open and proprietary models and report high metric–human correlations on several actions.
The paper is well-motivated: it highlights that many T2V clips “look right but move wrong,” and it argues convincingly that existing leaderboards over-reward pixel-space smoothness and text alignment while missing kinematics, rhythm, and camera-motion disentanglement—gaps that matter for realistic human movement. Methodologically, the benchmark is body-centric and interpretable. JAC targets joint-angle trajectories. DTW measures temporal phase/rhythm alignment in pose space. And MCM checks high-level motion consistency, making the evaluation actionable for diagnosing foot-slide, contact violations, or off-phase coordination. The authors run human validation and report strong correlations between Movo scores and pairwise human preferences across multiple actions (e.g., Walking ρ≈0.99), lending credence to the metrics. The experimental setup is transparent: the pipeline detects people with YOLO-X, extracts skeletons with RTMPose (including hands when needed), and fixes seeds/hyperparameters for open models.
1. By design, Movo focuses on skeletal kinematics and rhythm, leading to a narrow scope relative to general-purpose suites (e.g., VBench). In this case, the evaluation metrics and test set should be as comprehensive as possible for human videos. However, the proposed three metrics operate on detected skeletons, so systematic pose-estimation errors (occlusion, clothing, unusual viewpoints) propagate directly into scores. Besides, MCM is a binary MLLM judgment (“similar”/“not similar”), which the authors acknowledge can mask subtle fidelity gaps. Such discretization reduces sensitivity and may be unstable across prompts/models. Moreover, dataset coverage is limited and may not represent “human motion” broadly. The evaluation set consists of ten exercise-style actions (deadlift, squat, walking, etc.), a consciously simplified taxonomy the authors justify, but which excludes many everyday or multi-agent motions (sitting/standing transitions, dancing with turns, interactions, sports with equipment), raising questions about representativeness. Camera-aware prompts further restrict camera dynamics that many T2V systems must handle.
2. Comparisons across models are uneven. Sora was evaluated on only 10 prompts per category (access-limited), and Veo was accessed only via its hosted API defaults, making some leaderboard conclusions preliminary and harder to compare apples-to-apples.
3. Except from running many open-sourced models and commercial-level models, this paper did provide many insights how to train or how to improve t2v models in human videos, making the contribution of this paper less convincing.
Please see the weaknesses. |
Moderately AI-edited |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces MOVO, a kinematics-centric benchmark for evaluating human motion realism in text-to-video (T2V) models. MOVO includes a posture-focused dataset, three novel metrics (JAC, DTW, MCM), and human validation studies. The benchmark is applied to 14 T2V models, revealing gaps in biomechanical plausibility and temporal consistency. The work is timely and relevant, addressing critical shortcomings in existing T2V benchmarks.
- Addresses a critical gap in T2V evaluation—human motion realism.
- Introduces kinematics-aware (JAC), rhythm-sensitive (DTW), and structure-consistent (MCM) metrics.
- Limited diversity, e.g., lacks complex motions like multi-person interactions.
- Camera-motion disentanglement is claimed but not clearly demonstrated.
- Lacks deeper insights into why models perform differently across actions.
Please see Weaknesses for details. |
Moderately AI-edited |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Movo, a novel and much-needed benchmark for evaluating the realism of human motion in text-to-video (T2V) generation. The authors convincingly argue that current state-of-the-art T2V models, despite their impressive visual fidelity, often produce human movements that are biomechanically implausible, leading to artifacts like foot-sliding and unnatural joint articulation. They posit that existing benchmarks are ill-equipped to detect these flaws as they primarily focus on pixel-level consistency, prompt fidelity, and overall aesthetics, while ignoring the underlying kinematics.
Movo's main contributions are threefold:
A curated, posture-focused dataset with camera-aware prompts designed to isolate specific human motions and minimize confounding factors from camera movement.
A suite of three complementary, kinematics-centric metrics—Joint Angle Change (JAC), Dynamic Time Warping (DTW), and a Multi-modal LLM-based Motion Consistency Metric (MCM)—that evaluate motion from the perspectives of joint articulation, temporal rhythm, and semantic consistency, respectively.
An extensive human validation study that demonstrates a strong correlation between Movo's automated scores and human perception of motion realism, confirming the benchmark's efficacy.
By evaluating 14 leading T2V models, the paper provides a comprehensive snapshot of the current landscape, revealing systemic weaknesses in generating realistic human motion and highlighting the significance of their proposed evaluation paradigm.
(1) Originality and Significance: The paper's primary strength lies in its originality and high significance. It is, to my knowledge, the first work to propose a comprehensive, kinematics-centric benchmark for human motion realism in T2V. It fundamentally shifts the evaluation paradigm from "does it look good?" to "does it move correctly?". As T2V models are increasingly used to simulate reality, this work addresses a critical bottleneck for applications requiring physical and biological plausibility (e.g., synthetic data for robotics, sports analysis, AR/VR). Movo has the potential to become a standard benchmark in the field.
(2) Quality and Rigor: The quality of the research is outstanding. The benchmark is thoughtfully designed, from the careful taxonomy of the dataset to the multi-faceted metric suite. The execution of the experiments, involving 14 prominent models (including giants like Sora and Veo 3), is comprehensive and provides an invaluable service to the community. The strong human-in-the-loop validation solidifies the benchmark's credibility.
(3) Clarity: The paper is written with exceptional clarity. The authors articulate a complex problem and their sophisticated solution in a manner that is accessible yet detailed. The motivation is compelling, and the link between the identified problems and the proposed solutions is crystal clear.
(4) Actionable Insights: The results are not just a leaderboard; they provide actionable insights. For instance, the finding that models struggle with fine-grained lower-limb coordination or that DTW can expose rhythm drift even in visually smooth videos gives concrete directions for future model development.
(1) Dependency on Pose Estimator: The entire evaluation pipeline is contingent on the performance of the underlying pose estimator (RTMPose). T2V models can generate artifacts (e.g., blurred limbs, extra limbs) that might cause pose estimators to fail or produce noisy outputs. The paper does not discuss the potential impact of pose estimation errors on the final evaluation scores. A brief discussion on the robustness of RTMPose on generated content or an analysis of failure cases would strengthen the paper's claims of reliability.
(2) Lack of Analysis on Individual Metric Contribution: The paper shows a high correlation between the average of the three metrics and human scores. However, it does not provide an analysis of how each metric (JAC, DTW, MCM) individually correlates with human judgment. Such an analysis could reveal, for example, whether humans are more sensitive to incorrect joint angles (JAC) or poor rhythm (DTW), providing deeper insights into human perception of motion.
(3) The benchmark's core methodology is fundamentally limited by its reliance on a ground-truth reference video for its primary metrics (JAC and DTW). This introduces several critical flaws:
(a) Reduces Evaluation to Similarity Matching: It relegates the evaluation from a true assessment of generation plausibility to a task of similarity matching. Consequently, Movo cannot evaluate the realism of novel prompts (e.g., "an astronaut doing a backflip on the moon") for which no reference video exists, thereby restricting its scope to a predefined set of common actions.
(b)Creates a Single-Reference Bias: The approach penalizes plausible motion variations (e.g., differences in speed, style, or execution) simply because they deviate from the one chosen exemplar. This conflates stylistic difference with a lack of realism, potentially punishing valid and creative outputs.
(4) Details of the MCM "Judge": The Motion Consistency Metric (MCM) relies on a multi-modal LLM. The reliability and potential biases of this "judge" are important factors, such as photorealism or artistic style, rather than the pure kinematics of the motion. This creates a risk that the metric rewards aesthetic alignment over biomechanical correctness.
(5) From "Standard Exercises" to "Everyday Motion": The benchmark is constructed around 10 specific fitness exercises. These are highly structured, often periodic activities with well-defined kinematic patterns. However, the paper’s title and conclusions aspire to a much grander goal. There is a substantial chasm between the biomechanics of a gym squat and the complex, unpredictable motions encountered in the real world. For example, motions such as a person slipping on a wet surface, a toddler learning to walk with unsteady steps, or two people navigating a crowded street are characterized by non-periodic, reactive, and interactive movements. These chaotic, emergent scenarios represent the true challenge for T2V models aiming to simulate reality, and the conclusions drawn from Movo's controlled environment may not generalize to these far more complex situations.
(1) On Pose Estimator Robustness: How did you handle cases where the RTMPose estimator might have failed or produced unreliable keypoints due to artifacts in the generated videos? Did you filter out such cases, and if so, how might this affect the overall model rankings? Could you comment on the sensitivity of your metrics to noise in the keypoint data?
(2) On Individual Metric Correlation: Could you provide a breakdown of the correlation with human scores for each of your three metrics (JAC, DTW, MCM) individually? This would be very insightful for understanding which aspects of motion realism are most salient to human observers and would further validate the contribution of each component of your metric suite.
(3) On Extending Movo: The current dataset focuses on well-defined, single-person fitness motions. Do you have plans or thoughts on how the Movo framework could be extended to evaluate more complex, less structured, or interactive motions, such as dancing or team sports, where realism is equally crucial but harder to define?
(4) On the MCM Metric: Could you provide a brief summary in the main text of the MLLM used for MCM and the core of its prompt? Given that different MLLMs can have different biases and capabilities, how did you ensure the consistency and reliability of this metric? |
Fully AI-generated |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Movo, a new benchmark for evaluating the realism of human motion in videos generated by text-to-video (T2V) models. Movo consists of three main components: a "posture-focused" dataset with prompts designed to isolate specific human actions, a set of "skeletal-space" metrics (JAC, DTW, and MCM) to quantify motion realism, and human validation studies to correlate these metrics with human perception. The paper evaluates 14 T2V models using the Movo benchmark and finds that while some models excel at specific motions, there are still significant gaps in generating consistently realistic human movements.
1. This paper is well-written and it is easy to follow.
2. The Movo benchmark is well-designed and comprehensive. The three proposed metrics—Joint Angle Change (JAC), Dynamic Time Warping (DTW), and Motion Consistency Metric (MCM)—provide a multi-faceted approach to evaluating motion realism, capturing different aspects from joint articulation to temporal consistency.
1. The Movo dataset, while a good starting point, is limited to a relatively small set of 10 different human motions. This may not be representative of the full range of human movements, and it would be beneficial to expand the dataset to include a more diverse set of actions in future work.
2. The proposed metrics rely on the output of a pose estimation model to extract skeletal keypoints from the generated videos. The accuracy of these metrics is therefore dependent on the accuracy of the pose estimation model. It would be valuable to analyze the sensitivity of the Movo benchmark to errors in pose estimation and to consider alternative approaches that are less reliant on this intermediate step.
3. The MCM is a binary metric that simply indicates whether a multi-modal large language model (MLLM) judges two videos as having "similar" or "not similar" motion. This is a rather coarse measure of motion consistency, and it would be beneficial to develop a more nuanced metric that can capture the degree of similarity or dissimilarity between two motions.
4. The paper does not provide many details about the MLLM used for the MCM, other than it being a "multi-modal large language model." The specific model used and the prompts provided to it could significantly influence the results. More transparency on this aspect would strengthen the reproducibility of the work.
1. The paper mentions the use of Gemini-2.5 Pro and GPT-4o for generating and refining video descriptions. Could the authors elaborate on the specific roles of each model in this process and provide more details on the prompts used to guide these models?
2. The human validation study is a crucial part of the paper. Could the authors provide more information about the demographics of the human annotators and the instructions they were given? Were the annotators experts in biomechanics or motion analysis?
3. How robust are the proposed metrics to variations in video quality, such as compression artifacts or motion blur? Have the authors conducted any experiments to evaluate the performance of the Movo benchmark under such conditions?
4. The paper evaluates a number of proprietary, closed-source T2V models, including Sora. Given the limited access to these models, how did the authors ensure a fair and comprehensive evaluation? Could the authors provide more details on the methodology used to generate videos from these models? |
Fully AI-generated |