|
VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a benchmark for the video connecting task, which requires video generation models to synthesize the missing intermediate frames between the start and end video clips. Specifically, this paper apply scene detection, periodic motion detection, and video clips extraction to construct the datasets, which provides the diversity on open-domains. This paper further proposes 9 metics based on video quality, start-end consistency and transition smoothness score. Comprehensive experiments are conducted on various video genernation models.
1. The proposed task is highly valuable and meaningful for future research.
2. This paper is well-organized and easy to read.
1. The paper needs a more in-depth comparative analysis beyond the task definition to clarify what specific innovations have been made in constructing this benchmark, especially compared with the existing First-Last Frame to Video task.
2. Compared with the existing First-Last Frame to Video task, what different requirements does the video connecting task impose on the generation model? Are there corresponding experiment results to support this in the evaluation?
Please see the weakness. |
Fully human-written |
|
VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces the novel task of Video Connecting, which aims to generate transitional video content that seamlessly links a given start clip and end clip. The authors propose VC-Bench, a new benchmark to evaluate this task, consisting of a 1579 video dataset and a three part evaluation framework (Video Quality Score, Start-End Consistency Score, and Transition Smoothness Score). The paper provides a comprehensive evaluation of several video generation models on this benchmark, identifying current limitations in start-end consistency and transition smoothness.
- The paper formalizes the Video Connecting task, which, while related to existing video generation problems, presents a non-trivial challenge. The paper provides a valuable comparison by adapting and evaluating several recent state-of-the-art video generation models for this new task.
- The paper offers a detailed pipeline for the VC-Bench dataset construction and the calculation of the proposed evaluation metrics.
- The long-term impact of the benchmark may be limited, as Video Connecting could be viewed as a niche or minor task rather than a foundational problem. There is significant overlap with existing video generation, extension, or interpolation tasks, and few works are specifically dedicated to this problem, which may limit the benchmark's adoption.
- The dataset construction pipeline (e.g., scene detection, clip filtering, captioning) and core evaluation metrics (particularly the Video Quality Score components) are largely adopted from existing benchmarks for text-to-video and image-to-video generation. The newly proposed task-specific metrics, such as the Start-End Consistency Score and Transition Smoothness Score, appear to be straightforward implementations and lack significant novelty.
- The proposed SECS and TSS metrics rely on a direct comparison against the ground-truth video. For a generative task, there are potentially many plausible ways to connect two clips. Relying on ground-truth similarity may unfairly penalize novel or creative, thus making the metrics less reliable for evaluating the true generative capabilities of a model.
- Related to the point above, while the paper distinguishes the VC task from First-Last Frame to Video generation, the evaluation metrics do not seem to fully capture the complexity of ensuring content consistency with the entirety of the start and end clips, instead focusing on pixel-level and optical flow comparisons which are still largely frame-based.
- Regarding the human alignment evaluation (Section 5.3): Did the authors just check the correlation with human scores, or did they actively try to make the metrics match human preferences? For instance, how were the weights for the sub-metrics (in VQS, SECS, TSS) decided? Were they tuned to match human scores, or just set by a simple rule, like averaging? |
Lightly AI-edited |
|
VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces VC-Bench, a novel benchmark for evaluating models on the emerging Video Connecting (VC) task — generating smooth, temporally coherent transitions between given start and end video clips. The benchmark includes a curated dataset of 1,579 high-quality videos spanning 15 major and 72 subcategories, along with 9 quantitative metrics that evaluate three key dimensions: Video Quality, Start-End Consistency, and Transition Smoothness. Results highlight the open-source models’ limitations in maintaining continuity and temporal smoothness, while human evaluation shows strong alignment with the proposed metrics.
- Novel Task Definition: Clear formulation of the Video Connecting task as a distinct challenge, bridging isolated generation and temporal continuity
- Comprehensive Benchmark Design: A well-curated dataset with rigorous filtering, aesthetic scoring, and scene detection ensures quality and diversity.
- Model Diversity: Evaluation excludes closed-source systems (e.g., Sora, Runway Gen-3) that might exhibit different performance trends.
- Metric Interpretability: Some metrics (e.g., Video Connecting Distance) could benefit from additional qualitative examples to illustrate their perceptual meaning.
- Minor Writing Artifacts: Occasional typographical spacing and minor stylistic inconsistencies could be refined.
- How sensitive are the evaluation metrics (especially TSS and SECS) to different video lengths and resolutions?
- Could VC-Bench be extended to evaluate multi-clip or looped video transitions beyond simple start-end pairs? |
Fully AI-generated |