|
PISCES: Annotation-free Text-to-Video Post-Training via Bi-objective OT-aligned Rewards |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces PISCES, a novel annotation-free post-training framework for text-to-video (T2V) diffusion models, designed to improve text-video alignment. It proposes two distinct rewards: (1) an OT-aligned Quality Reward, which aligns text and real-video embeddings to improve visual quality, and (2) an OT-aligned Semantic Reward, which enforces spatio-temporal correspondence between text tokens and video patches using an OT plan. These rewards are optimized alongside the Consistency Distillation (CD) loss. The authors demonstrate that PISCES outperforms existing annotation-free and even annotation-based methods on the VBench benchmark.
1. It successfully avoids the need for expensive, large-scale human preference datasets, which is a significant bottleneck for scaling T2V models.
2. The quality reward addresses the distributional misalignment between text and video embedding spaces, which is a novel approach to improving video quality using the NOT approach.
3. The proposed rewards demonstrate strong generalizability by showing effectiveness with two different base models and different optimization paradigms.
1. The paper's core claim seems inconsistent between the main results (Table 1) and the ablation studies (Table 3, 5). According to Table 5, the main results in Table 1 show that the results of PISCES are obtained by GRPO. Since the baseline T2V-Turbo-v2, which also post-trains a video model using CD loss with external rewards via Direct Backpropagation optimization, shows 83.93 in Quality score, outperforming the proposed method's result in Table 3 (83.73). This suggests the performance gains seen in Table 1 might stem more from the GRPO optimization strategy than from the proposed OT-rewards.
2. In Algorithm 1, it states 'perform ODE solver from $t_{n+k}$ to 0,' which implies a multi-step denoising process for generating the clean video. This would be computationally prohibitive for backpropagation. Also, this notation contradicts Appendix A, which claims single-step denoising is used for efficient backpropagation. This ambiguity obscures the exact gradient flow mechanism.
3. The proposed method requires (1) pre-training a Neural OT map and (2) solving a discrete, spatio-temporal OT problem within the training loop. This likely introduces additional computational overhead. The paper provides no analysis of this extra cost, making it difficult to assess the practical trade-offs. In addition, the paper provides insufficient detail on the training setup for NOT training.
1. In Table 3, 'PISCES w/o OT' setting shows that it does not use OT (OT columns are 'x') while simultaneously using rewards (columns are 'v'). Is it the typo? |
Lightly AI-edited |
|
PISCES: Annotation-free Text-to-Video Post-Training via Bi-objective OT-aligned Rewards |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents PISCES, an annotation-free post-training framework for text-to-video (T2V) that replaces vanilla VLM-based reward signals with a Bi-objective OT-aligned Rewards module. The key idea is to align the reward embedding space to better reflect human judgments, addressing that pre-trained VLM embeddings are misaligned across text/video modalities. Concretely: (i) a Distributional OT-aligned Quality Reward learns a neural OT map (NOT) T⋆ that projects text embeddings onto the real-video embedding manifold; quality is computed as cosine similarity of [CLS] representations T⋆(y[CLS]) and x̂[CLS]. (ii) a Discrete token-level OT-aligned Semantic Reward computes a partial OT plan P⋆ over spatio-temporal cost to align text tokens to relevant video patches and feeds the fused attention map to a VTM head to produce a semantic reward. The module supervises fine-tuning either via direct backprop through reward models or via RL (GRPO). On VBench, PISCES improves both Quality and Semantic scores over annotation-free baselines (T2V-Turbo, -v2) and annotation-based methods (VideoReward-DPO, UnifiedReward), for both short-video (VideoCrafter2) and long-video (HunyuanVideo). Human preferences also favor PISCES in visual quality, motion quality, and alignment.
- Clear problem diagnosis: annotation-free reward methods depend on misaligned text/video embeddings; the paper substantiates this with Mutual KNN and Spearman rank results and t-SNE visualizations.
- Principled remedy using OT: (a) a learned distributional OT map T⋆ to align text embeddings to the real-video manifold; (b) a partial OT plan with spatio-temporal costs to enforce fine-grained token grounding. Both are well-motivated and integrated cleanly into post-training.
- Strong empirical evidence: consistent gains on VBench over both annotation-free (T2V-Turbo, v2) and annotation-based (VideoReward-DPO, UnifiedReward) post-training; results hold across two generators (VideoCrafter2/HunyuanVideo) and with both direct backprop and GRPO.
- Ablations are informative: OT vs. no-OT; quality vs. semantic reward; partial OT mass m; spatial/temporal penalties; analysis of stability under consistent seeds.
- Metric alignment risk: Rewards use InternVideo2 features; VBench and human studies show gains, but further measurement with independent evaluators (e.g., alternative video-text encoders or third-party human studies) would mitigate concerns of “reward overfitting” to the same representation family.
- Compute/efficiency: Learning NOT (OT map) and solving discrete POT per layer/head adds cost. Report training-time overhead and inference-time impact (if any) and compare to annotation-based pipelines.
- Scope of generalization: Results are on WebVid10M/VidGen-1M distribution. Evaluate on distinct domains (e.g., robotics prompts with compositional actions) and under harder OOD shifts to stress-test the alignment benefits.
- Design sensitivity: The spatio-temporal cost is hand-weighted (γ, η). More extensive sensitivity analysis or a learned weighting strategy could make it more robust.
- Evaluator diversity: Can you add evaluations using a different video–text model (e.g., CLIP-based video encoders) to demonstrate that improvements are not specific to InternVideo2 features?
- Cost accounting: What is the wall-clock/GPU-day overhead of NOT and POT modules during post-training? Any measurable inference-time overhead when using GRPO-tuned models?
- Failure modes: Show qualitative failure cases where OT alignment harms outputs (e.g., biased mappings or spurious correlations). Any cases where partial OT mass m or constraints degrade alignment?
- Human preference details: Provide inter-rater agreement and demographic diversity; were raters blind to method identity? Any observed category-level biases (e.g., motion-heavy vs. static prompts)?
- Reward fusion: For direct backprop, how do you weight the rewards vs. the consistency loss? Any benefit to adaptive weighting guided by variance or confidence in reward estimates? |
Fully AI-generated |
|
PISCES: Annotation-free Text-to-Video Post-Training via Bi-objective OT-aligned Rewards |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
PISCES is an annotation-free post-training method that aligns reward signals with human judgment via a Bi-objective OT-aligned Rewards module. It uses optimal transport to bridge text–video embeddings at two levels:
(1) a Distributional OT-aligned Quality Reward capturing overall visual quality and temporal coherence, and (2) a Discrete token-level OT-aligned Semantic Reward enforcing semantic and spatio-temporal correspondence between text tokens and video patches. This dual-objective design provides rich, label-free supervision that improves text–video alignment and fidelity during post-training.
1. Annotation-free and scalable: Post-training without human labels reduces cost and bias, enabling easy scaling to large datasets.
2. Bi-objective rewards: Combines a distributional OT-aligned quality reward with a token-level OT-aligned semantic reward.
3. Robustness to noise/misalignment: Spatio-temporal constraints and partial OT selectively match salient tokens/patches, suppressing spurious alignments.
1. Insufficient differentiation from prior OT/POT work.
The paper does not clearly articulate what is fundamentally new relative to existing optimal-transport or partial-OT approaches for cross-modal alignment and reward shaping.
2. Limited analysis of bi-objective interactions.
The interaction between the quality-level and token-level rewards is not characterized. There are no training/reward curves or gradient cosine-similarity analyses to diagnose potential conflicts, leaving stability and robustness unclear.
3. Risk of reward hacking without mitigation evidence.
The method may optimize one objective at the expense of the other, but the paper presents fewer empirical checks. Mitigation strategies and failure case studies are necessary to assess robustness.
1. What is fundamentally new relative to existing OT/partial-OT methods for cross-modal alignment and reward shaping? Please specify the conceptual and technical differences
2. Do the two objectives interact adversely during training? Please report training/reward trajectories and the cosine similarity between their gradients, and discuss any observed conflicts.
3. Have you observed reward hacking (e.g., improving one objective degrades the other or overall performance)? If so, what mitigation strategies did you employ, and how effective were they? |
Moderately AI-edited |