|
Planning at Inference: MCTS Test-Time Scaling for Long Video Generation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper frames long video generation as sequential decision making and proposes test time search with Monte Carlo Tree Search to plan over chunked continuations, guided by a process reward model for local chunk quality and an outcome reward model that aggregates scores over the full sequence. The method is model agnostic, sits on top of existing backbones without retraining, and introduces a multi tree variant to widen exploration in continuous spaces. Across several generators, the approach improves temporal consistency and object permanence relative to autoregressive decoding, Best of N, greedy, and beam search, and reports longer, competitive quality videos when compared qualitatively and with automated metrics to recent long video systems. The paper provides algorithmic details, ablations on compute budget, and comparisons of single tree versus multi tree search, while also acknowledging dependencies on the underlying generator and verifier quality.
1. Clear formulation of long video generation as planning with Monte Carlo Tree Search, including a walk through of selection, expansion, rollout, and backpropagation plus an explicit UCB objective.
2. Multi tree search broadens exploration under a fixed branching factor and empirically outperforms single tree for the same budget.
3. Practical recipe that is plug in and does not require retraining, which increases utility for current systems constrained by backbone quality.
1. Heavy reliance on automated reward signals for both search guidance and evaluation, with outcome reward defined as a simple sum over chunks, risks overfitting to verifier idiosyncrasies rather than human preference on long horizon coherence. A controlled human study is missing.
2. The exploration constant, branching factor, rollout policy, and beam initialization depth can strongly affect MCTS behavior. Sensitivity analysis is not comprehensive.
1. How sensitive are results to the weighting of VideoScore, CLIP alignment, and the LAION perceptual model in the process reward, and to the definition of the outcome reward as a sum rather than a learned temporal model
2. Under a fixed wall clock and identical hardware, how does the method compare to beam and greedy tuned for the same final runtime, including beam initialization time and rollout parallelism |
Heavily AI-edited |
|
Planning at Inference: MCTS Test-Time Scaling for Long Video Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes using MCTS for planning-based long video generation, which expands an important direction in the TTT field. Through this approach, the paper even achieves long video generation results that surpass closed-source SOTA models, demonstrating the potential of TTT in long video generation.
- The work has a certain degree of novelty and community value. The paper is the first to apply MCTS-based TTT to long video generation, showcasing the value of classical methods in the video domain.
- The experimental results are impressive. The proposed method enables Cosmos-Predict2 to surpass or tie with closed-source SOTA models (Sora/Kling), which demonstrates the strong potential of TTT.
- Tab. 5 should include a comparison of the computational cost.
- Regarding the long-video baselines, the paper would be more sound if a more comprehensive set could be included [1,2]
[1] FIFO-Diffusion: Generating Infinite Videos from Text without Training
[2] Skyreels-v2: Infinite-length film generative model
- The paper lacks discussion and comparison with several recently accepted works on long-video generation.
[1] Zhao et al., Riflex: A Free Lunch for Length Extrapolation in Video Diffusion Transformers (ICML 2025).
[2] Tan et al., FreePCA: Integrating Consistency Information Across Long-Short Frames in Training-Free Long Video Generation via Principal Component Analysis (CVPR 2025).
[3] Lu et al., FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention (NeurIPS 2024).
[4] Cai et al., DitCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation (CVPR 2025).
See the Weaknesses section. |
Lightly AI-edited |
|
Planning at Inference: MCTS Test-Time Scaling for Long Video Generation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The author introduces a Multi-Tree MCTS variant that improves exploration in continuous generation spaces.
1. The paper is well written.
2. The author introduces a Multi-Tree MCTS variant that improves exploration in continuous generation spaces. It is interesting.
1. I would like to know the time it takes to generate a 1-minute video with and without using your MCTS, and provide a quantitative comparison of the results.
2. The biggest issue with video generation is the excessive time consumption. This MCTS could make generating a long video take 24 hours, potentially requiring 20 times more time.
3. It is difficult to implement. The biggest challenge of this model is the accurate training of the Process Reward Model and Outcome Reward Model. As we know, video quality is hard to evaluate (the error rate of evaluation is high). Any slight error in the evaluation of these two models could lead to a massive search error.
4. MCTS does not have good robustness for the Process Reward Model and Outcome Reward Model.
5. I believe the author should focus on reinforcing the video model with reinforcement learning instead of using TTS, as it is a more efficient and practical solution.
see weakness |
Lightly AI-edited |