ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 4.50 2798
Heavily AI-edited 1 (25%) 6.00 3.00 1965
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 8.00 4.00 1953
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 6.00 4.00 2378
Title Ratings Review Text EditLens Prediction
Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces Pusa V1.0, a novel approach for enhancing temporal control in pretrained video diffusion models through Vectorized Timestep Adaptation (VTA). The authors claim that VTA enables fine-grained, frame-level temporal manipulation by replacing the conventional scalar timestep with a vectorized version, allowing asynchronous frame evolution. This non-destructive adaptation preserves the base model's text-to-video (T2V) capabilities while unlocking zero-shot performance on tasks like image-to-video (I2V), start-end frames, and video extension. The work emphasizes unprecedented efficiency, achieving state-of-the-art (SOTA) I2V results with minimal data (4K samples) and compute cost, compared to resource-intensive baselines like Wan-I2V. By leveraging flow matching and a lightweight fine-tuning strategy, Pusa V1.0 aims to democratize high-fidelity video generation and establish a scalable paradigm for multi-task video generation. I would like to highlight the following strong points of the proposed manuscript: 1. Novelty: The vectorized timestep concept, building on FVDM, is a creative extension that addresses key limitations in temporal modeling. The non-destructive VTA strategy is particularly innovative, as it avoids catastrophic forgetting and retains base model priors. 2. Efficiency: The paper demonstrates remarkable efficiency gains, with SOTA-level performance achieved using only 4K samples and low compute costs, making it accessible for broader research and industry applications. 3. Experimental results: Comprehensive evaluations on VBench-I2V, detailed ablation studies (e.g., LoRA configurations, inference steps), and mechanistic analyses (e.g., attention maps) provide strong empirical support. The inclusion of zero-shot multi-task results (e.g., start-end frames, video extension) further validates the method's versatility. 4. Reproducibility: The paper offers clear methodological details, including algorithms, hyperparameters, and training procedures, though reliance on proprietary base models (e.g., Wan-T2V) may pose minor barriers. Among the weak points I would focus on the following ones: 1. Some technical sections, such as the vectorized timestep embedding and per-frame modulation, could be explained more intuitively for readers unfamiliar with DiT architectures. The inference algorithm (Appendix B) lacks thorough discussion of its theoretical underpinnings. 2. While benchmarks show SOTA performance, comparisons are primarily limited to open-source models; broader evaluation against recent proprietary models (e.g., Sora, Veo) would strengthen claims. 3. The paper focuses on empirical results but provides limited theoretical analysis of why VTA avoids combinatorial explosion or how it generalizes across tasks. 4. Claims of "unprecedented efficiency" are compelling but could benefit from more context on real-world deployment challenges, such as latency or memory usage during inference. 1. Could you elaborate on the theoretical justification for why vectorized timesteps avoid combinatorial explosion during training, especially given the high-dimensional temporal composition space? 2. How does VTA handle varying video lengths or frame rates in zero-shot settings, and are there limitations in temporal consistency for long sequences? 3. The paper mentions using LoRA for parameter-efficient training; what are the trade-offs between LoRA ranks and adaptation quality, and how was the optimal rank (512) determined? 4. In the inference algorithm, why is clamping the first frame’s timestep to zero the default choice, and how does adding noise (e.g., τ¹=0.2*s) impact coherence quantitatively? 5. Could you provide more details on the dataset used for fine-tuning (e.g., diversity, resolution) and how it compares to datasets used by baselines to ensure fair evaluation? Fully AI-generated
Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Pusa V1.0, a lightweight adaptation for large pretrained text-to-video (T2V) diffusion models that replaces the scalar timestep with a vectorized timestep (one per frame) and learns with a frame-aware flow-matching objective. The key contribution is Vectorized Timestep Adaptation, a non-destructive modification that enhance the base model’s generation ability. With minimal fine-tuning (LoRA or full FT) on ~3.9k Wan-T2V generated clips, Pusa achieves SOTA-level image-to-video (I2V) results on VBench-I2V and further shows zero-shot behavior for other downstream tasks. 1. Simple, general idea with clear motivation. Moving from a scalar to a vectorized timestep directly attacks the synchronized-frame limitation of standard VDMs; the frame-aware flow-matching formulation is clean and well explained. 2. High efficiency. Comparable I2V performance to Wan-I2V with a tiny fraction of the compute/data, plus favorable results with only 10 sampling steps. The ablations (LoRA vs full FT, timestep sampling) are helpful. 3. Unified capability. The same model handles I2V, start–end, and extension without task-specific heads or destructive finetuning. 1. Lack of quantitative evaluation for non-I2V tasks. While the author mentions zero-shot, start–end and other applications of its method, it lacks quantitative evaluation for these applications; most quantitative focus is I2V on VBench-I2V. Concrete metrics are missing from the main text. It would be better for the auther to add quantitative results for applications. 2. Data source and generalization. Fine-tuning uses videos generated by the same base video model. This could limit diversity and introduce bias toward the base model’s distribution; it’s unclear how well Pusa generalizes to challenging real photos as I2V inputs beyond curated benchmarks. Could you provide some details of the ~3,860 Wan-T2V samples? And are there any data filtering you do? See the weakness. Lightly AI-edited
Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces Pusa V1.0, a method to adapt pretrained large-scale text-to-video diffusion models (e.g., Wan-T2V) to handle vectorized timesteps instead of a single scalar timestep. This Vectorized Timestep Adaptation (VTA) enables each frame to evolve independently, allowing fine-grained temporal control without destructively retraining or modifying the model. Pusa achieves state-of-the-art image-to-video (I2V) performance on the VBench-I2V benchmark using only 4 K samples and 0.5 K compute, while preserving the base model’s text-to-video ability. It also exhibits zero-shot generalization to other temporal tasks (start–end frames, video extension). Mechanistic analyses and attention visualizations show that Pusa injects temporal dynamics non-destructively, maintaining pretrained priors while improving efficiency 1. Turning scalar timesteps into vectorized ones for frame-wise control is conceptually elegant and integrates naturally into the diffusion pipeline. 2. Comparable or superior I2V results to Wan-I2V at orders-of-magnitude lower data and compute; clear quantitative and qualitative evidence. 3. Demonstrated versatility across I2V, start–end frames, video completion, and extension, suggesting a broadly useful paradigm. 1. Builds heavily on FVDM, the contribution mainly lies in adaptation, rather than introducing new theory. 2. While the paper emphasizes efficiency (4K samples, 0.5K compute), it does not explore how performance scales with larger datasets or longer fine-tuning. It remains uncertain whether Pusa’s gains plateau quickly due to its limited adaptation space, or if additional data and compute could further improve quality and generalization. Can Pusa be extended to handle long-video generation (>128 frames) without retraining, or is the approach limited by the base model’s context length? How does the method perform on cross-domain generalization, e.g., artistic or high-motion datasets, compared with Wan-I2V? Heavily AI-edited
Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents Pusa V1.0, a method that introduces Vectorized Timestep Adaptation to pretrained text-to-video diffusion models. By replacing the traditional scalar timestep with a vectorized one, the model enables each frame to evolve independently, achieving fine-grained temporal control. The approach is non-destructive, preserving the base model’s text-to-video capability while extending it to image-to-video, start-end frame generation, and video extension in a zero-shot manner. 1. The idea of vectorizing the timestep variable is conceptually clear and provides a unified mechanism for multiple temporal tasks without architectural changes. 2. The proposed non-destructive adaptation preserves the pretrained model’s generative priors, avoiding catastrophic forgetting. 3. The analysis sections, including attention visualization and parameter drift studies, help explain why the approach works. 1. The evaluation mainly relies on VBench-I2V and qualitative visualization. Additional comparisons on broader benchmarks or user studies would strengthen the claims of generality and zero-shot capability. 2. Some methodological details, such as how vectorized timestep embeddings are fused with text conditioning or cross-attention, remain underexplained. 3. While the efficiency results are appealing, the paper would benefit from clearer discussions of potential trade-offs, for example, limits of per-frame desynchronization or artifacts under longer sequences. 1. Could the authors clarify whether the same fine-tuning hyperparameters work across different base models, or if task-specific tuning is needed? 2. How does the approach handle temporal consistency when τ vectors are sampled completely at random during training? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next