|
ZoomV: Temporal Zoom-in for Efficient Long Video Understanding |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces ZoomV, an agentic-style approach to long video understanding that differs from most existing methods, which typically rely on multiple specialized models for different sub-tasks. Instead, ZoomV offers an end-to-end framework using a single model. The approach is inspired by neural cognition, mimicking how humans selectively “zoom in” on relevant visual content to enhance understanding and reasoning. Experimental results demonstrate significant gains on video QA benchmarks, including state-of-the-art performance on the challenging LVBench, as well as strong improvements in temporal grounding tasks.
* The motivation is clear and well defined: The paper addresses long video understanding which is hard due to too many frames and loss of temporal context. It explains why existing solutions (uniform sampling, token sparsification, multi-model agents) are insufficient.
* It presents a novel human-inspired framework in how humans watch videos and in the human self reflection capabilities.
* End-to-end single-model design which is a very important point for video agents.
* ZoomV achieves state-of-the-art results for the complicated LVBench and competitive results for other Video Qa benchmarks when compared to video agents based on models with much more parameters (like GPT).
* Results show that for short-video datasets performance is not sacrificed when using this method specific for long video understanding.
* This paper present ablation studies on how their modules can be beneficial for long video understanding.
* The model is not training-free when compared to other video agents like VideoTree.
* I found Section 3.3 somewhat difficult to follow. It would be helpful to provide additional details or clearer explanations of the key steps and reasoning in this section to improve readability and allow the reader to better understand the contribution.
* Experiments on InternVL3 would be valuable to add even though the decision is not dependent on this.
* The baselines on Table 2 seem a bit weak. Would it be possible to include stronger baselines even though not fully designed for temporal grounding?
* The paper misses analysis about runtime overhead for very large videos when compared to other models besides VideoTree. While the paper highlights the advantage of not downsampling, the scaling behavior and trade-offs could be more explicitly benchmarked.
* I noticed a possible inconsistency in the evaluation protocol: for multiple-choice questions you ask “which choice is correct,” whereas for open-ended questions you ask whether the “temporal window is correct.” Could you clarify why different criteria are used for these two settings? Why can't you use $I_{tf}$ for both?
* What is the number of parameters for the models LLaVA-Video, QwenVL2.5 and InterVL2.5? I assume you are using the 72B versions, right? |
Fully human-written |
|
ZoomV: Temporal Zoom-in for Efficient Long Video Understanding |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors address the challenge of long-video understanding by proposing a mechanism to select the most relevant frames in a video to answer a given prompt. They introduce ZoomV, a method that leverages TemporalLinks. TemporalLink is an additional module for MLLMs designed to embed timestamp information into timestamp tokens, which are then linked to visual tokens. Their second contribution is Temporalight, an approach that utilizes the model's reflection to assess the relevance of a selected time window for a given query, providing a confidence score. By considering multiple windows with varying reflection confidence, the window with the highest confidence can be selected for the video understanding task. This approach effectively concentrates computational effort on the most pertinent input frames. The authors use their method on top of LlaVa, InternVL and Qwen2.5-VL and show improvement on MVBench, MLVU, LongVideoBench, VideoMME and LVBench. They also evaluate their method on different temporal ground benchmarks such as Charades-STA, ActivityNet-Caption and ReXTime.
- The paper is well written and the methods that is presented as TemporaLink and TemporaLight are sounded and relevant.
- Introducing self-correction to select the frames to use is an interesting idea, sometimes MLLMs are indeed better when used as judges.
- There are a number of missing baselines: other training based methods such as LongVu, LongVa, Frame-Voyager and training-free methods such as Adaptive Keyframe Sampling (AKS). Overall, I am concerned with the lack of comparison with other methods and the very short related work section that does not cite most common papers on video frame selections for MLLMs.
- Would have appreciate a deeper study on efficiency with a better discussion on training cost/time, inference time versus others methods such as LongVu or other frame selection method such as AKS.
- Would also have liked a more in depth study over the self-reasoning for window selection. Are some models better at that or did you observe the same results for LlaVA-Video, InternVL and Qwen?
Is there any reason why you did not compare your method with similar training based methods for frame selection? |
Fully human-written |
|
ZoomV: Temporal Zoom-in for Efficient Long Video Understanding |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper ZoomV, a query-aware temporal zoom-in framework designed for efficient and accurate long video understanding. It retrieves relevant events and their associated temporal windows as candidates, and select higher-confidence temporal windows as the LVLM's final input to provide the answer. It conducts experiments on temporal grounding benchmarks as well as long video understanding benchmarks to demonstrate the effectiveness of the proposed method.
1. The paper is clearly written and easy to follow.
2. The proposed approach is reasonable and methodologically sound.
3. Experiments are conducted on both temporal grounding and long video understanding benchmarks to demonstrate the effectiveness of the method.
1. One of the main contributions claimed by the paper is the confidence-based temporal grounding approach. However, this concept has already been introduced in TimeSearch [1]. Therefore, it cannot be regarded as a novel contribution of this work. Moreover, the authors have not properly cited TimeSearch to acknowledge prior work.
2. The technical novelty appears limited, as the main modification involves adding textual timestamps to each frame embedding, which was already employed in models such as Eagle2.5 [2] ([Eagle2.5 implementation](https://github.com/NVlabs/Eagle/blob/047e51070e8976978376cb828f7af92323c0f8ef/Eagle2_5/deployment/inference.py#L85))
3. The method seems not consistently effective to all video benchmarks, and the improvement is very trivial in several benchmarks such as MVBench and VideoMME.
4. Since the paper positions its approach as an agent-style method, it should also include comparisons with recent video agent frameworks such as Video-RAG [3].
5. Given that the model is fine-tuned on a recent backbone (Qwen2.5-VL), which already exhibits strong temporal grounding capabilities, it would be more convincing to compare against recent models fine-tuned on the same base, such as VideoChat-R1 [4] and Time-R1 [5].
6. The hierachical search is not a novel idea, which has already explored in TimeSearch [1], UniTime [6] and VideoChat-R1.5 [7]. They should be discussed in the related work and experiments.
7. Since the proposed methods are fundamentally based on temporal grounding, the paper should include a discussion of the temporal grounding task in the related work section.
8. The paper emphasizes efficiency in its title; however, it does not provide a comprehensive analysis of efficiency compared to the base models.
[1] TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding, arXiv:2504.01407.
[2] Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models, arXiv:2504.15271.
[3] Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension, NeurIPS 2025.
[4] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning, arXiv:2504.06958.
[5] Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding, NeurIPS 2025.
[6] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models, arXiv:2506.18883.
[7] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception, arXiv:2509.21100.
1. The improvement on VideoMME is very limited, for example only 0.1 on Qwen2.5-VL and no improvement on MVBench, while achieves 11.3 on LVBench. It seems that the method is not generalized to all video benchmarks. Could you explain why it achieves improvement by large margin on LVBench, but not effective to VideoMME.
2. Both VideoMME and LVBench are long-video understanding benchmarks that contain thousands of frames. However, the proposed method only samples 64 frames, which results in a substantial loss of visual details throughout the video and may prevent accurate grounding on evidence frames. Have the authors experimented with increasing the number of sampled frames? This could better demonstrate the effectiveness of the proposed approach.
3. The evaluation involves recursively exploring video frames, meaning that the total number of processed frames exceeds 64. How many frames are explored on average?
4. Considering the increased number of processed frames and the computational overhead, is it entirely fair to compare the results with the base model under a 64-frame input setting? A fairer comparison would be against the model’s officially reported best performance, for example, Qwen2.5-VL achieves 70.2 on MLVU and 65.1 on VideoMME.
5. How about the inference efficiency on long video benchmark like VideoMME compared with base models? |
Fully human-written |