|
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces **TTOM**, a training-free framework that enhances compositional text-to-video (T2V) generation through *test-time optimization (TTO)* and a *parametric memory* mechanism. TTOM aligns cross-attention maps of video foundation models (VFMs) with LLM-generated spatiotemporal layouts at inference time, improving text–video compositional alignment. Its parametric memory stores optimized parameters from previous prompts, enabling reuse, continual learning, and efficient streaming inference. Experiments on **T2V-CompBench** and **VBench** demonstrate significant improvements in compositional reasoning, semantic consistency, and visual quality over baselines such as CogVideoX‑5B and Wan2.1‑14B.
- Proposes a novel test-time optimization framework with memory-based reuse, extending inference-time adaptation for video generation in a principled way.
- Seamlessly integrates LLM-driven spatiotemporal layout planning for controllable, interpretable video synthesis.
- Achieves consistent quantitative and qualitative gains on major T2V benchmarks, including clear improvements in motion and numeracy.
- The design supports streaming inference and user-specific memory, aligning well with practical workflows.
- Comprehensive ablations, detailed comparisons, and visual evaluations substantiate the method’s effectiveness.
1. The method is largely empirical; there is a lack of theoretical explanation or analysis regarding optimization stability and memory dynamics over time.
2. Evaluation scenarios mainly focus on compositional prompts; testing under more complex, long-horizon, or open-domain conditions would strengthen generalization claims.
3. Dependence on large LLMs (e.g., GPT‑4o) for layout planning introduces computational and cost overheads that are not thoroughly discussed.
4. Runtime performance and latency analysis are insufficient; the paper should quantify the cost of test-time optimization steps and memory operations.
5. Some ablations, while numerically helpful, could benefit from visual examples or error diagnostics to clarify how memory and optimization complement each other.
see the weakness |
Fully AI-generated |
|
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a framework named TTOM (Test-Time Optimization and Memorization) for improving compositional text-to-video (T2V) generation. During inference, TTOM aligns video generation results with spatiotemporal layouts and introduces lightweight, optimizable parameters for adaptive per-sample updates. It further incorporates a parameterized memory mechanism that supports insertion, retrieval, update, and deletion operations, enabling the reuse of historical optimization context. Experimental results on T2V-CompBench and VBench demonstrate substantial gains, especially in motion, quantity, and spatial relation metrics.
However, the paper suffers from a serious conceptual confusion. Although TTOM is repeatedly described as a “training-free” framework, the core algorithm explicitly performs gradient-based parameter updates (e.g., optimizing LoRA weights) during inference. Such optimization is, by definition, a form of test-time training or test-time adaptation, not a training-free operation. Therefore, labeling the method as “training-free” is conceptually incorrect and misleading. The authors must clarify this terminology to ensure methodological and scientific accuracy.
1. The paper presents an innovative test-time compositional generation framework equipped with a streaming memory mechanism, distinguishing it from conventional per-sample optimization approaches. This enables continual adaptation and historical context reuse during inference.
2. By unifying per-sample test-time optimization with a parametric memory module, the method achieves continual adaptation and efficient context reuse across inference streams, representing a novel direction in test-time generative modeling.
3. Extensive experiments provide strong empirical evidence of its effectiveness, showing performance improvements of +34.45% and +15.83% over CogVideoX-5B and Wan2.1-14B on T2V-CompBench (Table 1), together with notable gains in semantic consistency on VBench.
4. While the idea of “training-free test-time optimization” is conceptually appealing, the misuse of terminology undermines the paper’s scientific rigor. Any process involving gradient updates or parameter optimization cannot be described as training-free. The term must be corrected in the abstract, introduction, and method sections to maintain terminological precision.
**1. Conceptual ambiguity in the “training-free” claim:**
The paper repeatedly claims that TTOM is “training-free,” yet the method explicitly introduces learnable parameters and updates them via gradient descent during inference. Regardless of whether this occurs at test time, such an optimization process constitutes training—albeit without external training data. Hence, “training-free” and “test-time optimization” are not interchangeable concepts.
This is a critical conceptual flaw. The authors should clearly distinguish among:
- training-free: no gradient updates at all;
- fine-tuning-free: no updates to the backbone parameters, but auxiliary modules may be trained;
- test-time optimization: gradient-based updates performed at inference for adaptation.
Since TTOM belongs to the third category, it should be explicitly defined as a test-time optimization–based framework, not a training-free one.
**2. Lack of theoretical analysis of the memory mechanism:**
The insertion and update strategies in Equation (3) lack mathematical formalization and convergence analysis. The paper provides only an implementation-level description, without proving or empirically testing stability, information retention, or forgetting dynamics under continual updates.
**3. Dependence on LLM-generated layouts:**
Section 3.1 heavily relies on GPT-4o for layout generation but lacks robustness evaluation. The paper does not quantify how common layout errors (e.g., bounding-box misalignment, temporal inconsistencies) influence test-time optimization or how the “verification step” mitigates error propagation.
**4. Insufficient evaluation of computational and storage overhead:**
The latency and storage implications of the memory system (especially when full) are not systematically reported. The paper lacks detailed measurements of optimization time, memory growth (LoRA parameter accumulation), and replacement cost under streaming conditions.
**5. Lack of justification for hyperparameter choices:**
The paper sets LoRA rank=32 and optimizes only the first five denoising steps without explaining why these configurations are near-optimal. No sensitivity analysis is provided for LoRA rank, learning rate, or optimization steps, limiting generalization insight.
**1. Clarification of “training-free” Definition**
The authors must explicitly state what “training-free” means in this context. Since the method involves gradient updates, how does it fundamentally differ from test-time training or test-time adaptation?
**2. Mathematical Modeling and Stability of Memory**
Could the memory insertion, update, and deletion processes be formalized (e.g., via state transition equations) and analyzed for stability or boundedness?
**3. Memory Capacity and Replacement Policies**
When memory reaches capacity, how do different replacement policies (LRU, LFU, FIFO) affect performance, latency, and storage footprint?
**4. Impact of Layout Errors on TTO**
How sensitive is TTOM to LLM-induced layout errors? How is the “verification step” implemented and evaluated quantitatively?
**5. Fault Tolerance and Error Accumulation**
How does TTOM prevent error propagation when mismatched memory entries are loaded or when the retrieved parameters are irrelevant?
**6. Fine-Grained Resource Analysis**
Could the authors provide detailed per-sample latency, GPU-hour costs, and LoRA parameter size during streaming?
**7. Hyperparameter Sensitivity**
Why were LoRA rank=32 and five denoising steps chosen? Are these settings robust across different backbones?
**8. Reproducibility and Randomness Control**
Please specify random seeds, GPT-4o prompt templates, and evaluation scripts for reproducibility.
**9. Failure Cases and Limitations**
What are the main failure modes (e.g., occlusion, excessive object count, ambiguous prompts)? |
Fully AI-generated |
|
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a method to enhance the alignment capability of compositional text-to-video generation during the inference phase through test-time optimization and a memory mechanism.
1. This paper introduces the TTOM framework, which combines test-time optimization with a parametric memory mechanism, enabling dynamic adjustment of model parameters at inference time and retaining historical optimization results for reuse with similar future prompts.
2. This approach not only improves text-video alignment in compositional scenarios but also supports continuous learning and personalized generation, demonstrating strong practicality and scalability. On mainstream benchmarks such as T2V-CompBench and VBench, the TTOM method significantly outperforms existing approaches across multiple key metrics (e.g., motion, numeracy, spatial relations), achieving overall improvements of up to 34% and 15% on large models like CogVideoX-5B and Wan2.1-14B, respectively, thereby validating its effectiveness and generalization capability.
1. The effectiveness of the method heavily relies on the quality of the spatiotemporal layouts generated by the LLM. If the bounding box sequences produced by the LLM are inaccurate, subsequent attention alignment optimization may fail or even introduce misleading guidance, compromising the quality of the generated content.
2. Test-time optimization requires extra gradient computations and parameter updates, which significantly increase inference latency, especially during continuous optimization and memory retrieval. Although the paper mentions that optimization is applied to the first five denoising steps of the diffusion process to partially mitigate this issue, its acceptability in real-world streaming generation scenarios still requires further validation.
Please refer to the weaknesses. |
Lightly AI-edited |
|
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
TTOM is a training-free framework for compositional text-to-video generation. It aligns model attention with LLM-generated spatiotemporal layouts via test-time optimization and stores optimized parameters in a parametric memory for reuse. This design enables continual adaptation across prompts, yielding significant improvements in motion, spatial alignment, and semantic consistency.
1. Clear motivation. the paper addresses a key limitation of video foundation models by improving their compositional understanding of objects, relations, attributes, and temporal dynamics within a realistic streaming generation setting.
2. Parametric memory flexibly manages prompt-level optimization results, enabling efficient transfer and adaptive refinement across prompts.
1. Line54 has two "in this paradigm”, typo error.
2. The cases shown in Figure 1, when compared with the baseline, reveal that the baseline also produces outputs that are partially semantically correct and sometimes exhibit higher visual quality — for instance, in the case of “An elderly man walking to the right in a sunny park.”
3. TTOM relies heavily on the layout quality produced by the LLM, assuming that the generated spatiotemporal layouts are accurate and consistent. However, layout hallucinations or spatial inaccuracies in the LLM outputs can directly cause alignment errors, misleading the optimization process and leading to degraded results.
4. There exists a structural trade-off between efficiency and quality. Continuous TTO introduces significant latency, while memory mechanisms can reduce computational overhead but require a warm-up phase to accumulate sufficient samples. Consequently, the deployment barrier remains high for online or low-latency applications such as real-time interactive editing.
1. How strongly does TTOM depend on the LLM’s prompt design, model choice, or layout verification? And what happens when the LLM produces noisy or uncertain spatial temporal layouts, can TTOM still perform reliably?
2. Has the system been tested for cases where outdated or irrelevant memories build up and start to hurt performance? Are there situations where a faulty memory causes negative transfer? i hope the author can provide some comparison between successful and failure cases.
3. as shown in the above weakness part. |
Lightly AI-edited |