ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 3.00 4038
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 6.00 4.00 1428
Fully human-written 2 (50%) 5.00 3.50 2496
Total 4 (100%) 5.50 3.50 2615
Title Ratings Review Text EditLens Prediction
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper argues that most existing ASVA (Audio-to-Visual Animation) models adopt the strategy of uniformly sampling video frames, which leads to two core problems in high-dynamic motion scenarios: (1) Failure to capture key audio-visual moments, resulting in unsmooth motion transitions. (2) Audio-visual temporal misalignment, especially for low-frame-rate models, which struggle to match the fine-grained temporal information of audio. Therefore, this paper proposes a keyframe-aware audio-to-visual animation framework that first localizes keyframe positions from the input audio and then generates the corresponding video keyframes using a diffusion model, which designs a keyframe generator network that selectively produces sparse keyframes from the input image and audio, effectively capturing crucial motion dynamics. 1. The thinking of uniform frames vs. keyframes generation and the keyframe-oriented pipeline in Figure.1 are interesting and beneficial to the research community. 2. The design of multi-condition cross attention fusion is delicate. 3. The quantitative comparison results and demos show the effectiveness of proposed method, which is convincing to me. 1. The ablation studies are not very convincing since the results of Table.2 are similar. Especially in terms of the “w.o. Frame Index” setting, the FVD improvement is 1.7% and the degradations of synchronization metrics are 2.1% ~ 2.4%. So it is not clear for me to understand the necessaries of Frame Index. 2. There is no computation efficiency analysis, which is essential for real-world applications. I am wondering that whether it is heavy to conduct the multiple-condition CA in the U-Net blocks. 3. The paper does not analyze the performance differences of the proposed method across different scenarios. The paper claims that its method is particularly advantageous in "intensive motion" scenarios (in Line.485), but this lacks quantitative analysis and verification. 1. Discuss and explain the effectiveness of proposed technical in this paper, especially the “Frame Index”. 2. Compare the time efficiency of the proposed method with those of baselines. For example, RealTime Factor(RTF) and GFlops should be taken into considerations. 3. Add more comparisons with baselines on intensive-motion scenarios and non-intensive-motion scenarios, and discuss the differences. 4. Will the code and pretrained model be released? Fully human-written
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents KeyVID, a keyframe-aware diffusion framework for generating videos that are temporally synchronized with input audio. The core idea is to exploit the correlation between peaks in the motion signal (optical flow intensity) and peaks in the audio signal to determine key moments of action. The system decomposes the task into three modules: a Keyframe Localizer that predicts motion peaks from audio, a Keyframe Generator that synthesizes visual frames conditioned on audio, text, and the first image, and a Motion Interpolator that fills intermediate frames for smooth transitions. While the underlying assumption “strong sounds correspond to large motions” is conceptually simple, the paper demonstrates that modular design and diffusion-based conditioning yield high-quality, audio-synchronized animations, outperforming prior methods (e.g., AVSyncD) in both quantitative metrics and human preference. The paper’s strength lies in its clear conceptual simplicity combined with strong engineering design. Instead of introducing a novel generative paradigm, it isolates key factors affecting audio-visual synchronization and builds an effective three-stage system around them. The modular structure (localization–generation–interpolation) makes the overall process interpretable and flexible. The idea of learning motion saliency from audio peaks via optical-flow supervision is intuitive yet elegantly implemented, enabling temporal precision without requiring explicit motion labels. Moreover, the integration of first-frame conditioning and frame index embeddings ensures temporal consistency and visual coherence across non-uniformly sampled keyframes—an aspect that many prior diffusion-based approaches fail to achieve. Experimental results are convincing, showing SOTA performance on both synchronization and visual quality metrics. The paper is also well-written, with clear motivation and comprehensive ablations that help readers understand the contribution of each module. The proposed framework feels robust, scalable, and generalizable beyond its training distribution. Despite its strong empirical results, the conceptual novelty is somewhat limited. The paper’s main assumption—that audio peaks align with motion peaks—is simple and well-known in the audio-visual literature. The novelty mainly comes from a careful engineering decomposition rather than a new theoretical insight. The keyframe selection mechanism remains heuristic (based on fixed thresholds and local extrema), which, while effective, feels ad hoc and could limit robustness for more complex or subtle motion types. For instance, the model performs less consistently on “subtle-motion” videos (e.g., violin, trumpet) or single-event sequences (e.g., frog croaking), where perceptual synchronization is harder to judge and the heuristic peak detection may fail. Furthermore, the 2-second clip length used in both training and user studies constrains the evaluation of long-term consistency and overall narrative quality. The model’s dependence on the first frame also raises concerns about appearance drift or overfitting to static conditions when generating longer sequences. In addition to the weakness, it would be great if authors can response to the following minor comments. - The paper would benefit from more discussion of failure cases, especially where KeyVID underperforms in the user study (e.g., low-motion or single-event clips). - Figure 5 and Appendix F could be expanded to show visual differences in subtle-motion scenarios, not just high-intensity ones. - The authors might consider exploring learnable or probabilistic keyframe selection instead of the fixed heuristic used in Section 3.1. - The limitation of using short 2-second videos for subjective evaluation should be explicitly acknowledged; looping or extended clips could help reduce perceptual bias. - It would be interesting to see comparisons against pose-based or structure-aware baselines such as TANGO, to assess generalization to human-centric motion. Fully AI-generated
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an approach for audio-driven image animation, where static images are animated into videos synchronized with input audio both semantically and temporally. The method decomposes the animation process into two stages: the first generates keyframes corresponding to key actions derived from the audio, and the second interpolates between these keyframes to produce continuous motion. Both stages use a video inbetweening model to generate frames. I appreciate the idea of generating keyframes or key actions first, which need not be uniformly distributed. This design effectively mitigates the potential mismatch between audio and generated video arising from differences in their sampling frequencies. 1. I am skeptical about the definition of keyframes as frames with peak motion scores. The authors should discuss the applicability and limitations of this definition. For instance, in dance videos, key movements often occur on musical beats, where the motion velocity is near zero—these moments would not correspond to frames with the highest motion scores. 2. I would like the authors to provide further justification for this keyframe definition. 3. Based on the provided video result, the method appears to be applicable primarily to sound events. Moreover, the paper presents too few video examples to convincingly demonstrate the effectiveness of the proposed approach. See the above weakness section Lightly AI-edited
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper is for adding audio conditon to existing text-image to video (TI2V) model. 1. The backbone is DynamiCrafter, the dataset is open-source audio-visual generation dataset AVSyncD. The generated videos are around 2s (48 frames). 2. The target is to solve the audio-visual misalignment. While the idea is first select audio keyframes, then generate keyframes using selected audio, and finally do video interpolation. - The authors train a audio-to-optical flow network to predict optical flow and select audio keyframes based on local minimum/maximum. - Use this keyframes audio feature, image and text, and the target generate frame idx to generate video - Video interplotation is by finetuning the DynamiCrafter with Wan 2.2 style image mask condtioning. 3. The objective score beats SoTA and 7 videos results attached. 1. The paper is well written and easy to follow. 2. The evaluation contains both objective and subjective mertic/samples and it shows results better than previous methods. 3. The authors included the detail of each module in appendix. 1. The high level idea sounds rule-based and do not have enough evidence why it is better than generating all frames in once. - limition of rule based design: using optical-flow and picking local minimum/maxmum may not suitbale for some smooth audio, e.g., river, plane takes off. the idea maybe not general enough to push to boundary of current ATI2V model. it may require a more general mapping model, for example based on the contrastive learning like text and image. - how to set the threshold of key frames number? for the hammer case, if the speed of hitting is very fast, e.g. 10 times in 2 second, should we have a 20-frame keyframe at least. 2. The implemenation, using a video model to generate discontinus frames by a learned frame embeding but keeping the original rope sounds not strightforward. - firstly only using select audio keyframes feature, will this be enough? considering a hammer case only the sounds of hitting is captured. - adding the frame idx condtion to the network, is it possible to directly modify existing position embedding? Overall this is a paper that clear written, and have completive experiments. My concern is the idea itself sounds rule-based and not general. I'm wondering for 2s audio-video generation, this is a length we have enough GPU memory to train directly, maybe end2end modeling could get good results after filteriing out the misaliged audio-visual data from the dataset. The details of my questions are in weakness part. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next