|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a new video editing task called generative video compositing and proposes the first feasible solution, GenCompositor. Based on the Diffusion Transformer architecture, this method can automatically composite dynamic elements from a foreground video into a background video according to user-specified trajectories and dimensions.
1. This work is the first to introduce the "generative video compositing" task, which aims to automatically composite foreground video elements into a background video using generative models while supporting user control over attributes such as trajectory and scale.
2. A complete architecture is proposed, including a background preservation branch, DiT fusion blocks, and Extended Rotary Position Embedding (ERoPE), aiming to address the three main challenges: background consistency, foreground injection, and user control.
1. Appendix F shows that extending RoPE along height, width, or temporal dimensions yields nearly identical training loss curves, leading authors to conclude these three directions are equivalent. However, this dimension-agnostic behavior seems contradictory to the inherent spatial-temporal characteristics of videos. If ERoPE truly models spatial layout relationships, spatial extensions (height/width) should outperform temporal extension since they directly encode spatial proximity. Conversely, if it captures motion dynamics, temporal extension should be more effective given the causal and directional nature of video frames. The complete equivalence across all three dimensions suggests that ERoPE may simply function by expanding the position encoding to avoid feature conflicts, rather than genuinely modeling spatial-temporal interactions between foreground and background. Could the authors clarify this dimension-independence phenomenon and explain whether simpler alternatives (such as learnable position offsets for different video sources) might achieve comparable results?
2. The practical usage requires SAM2 for foreground segmentation and optical flow algorithms for trajectory tracking, and these preprocessing steps may introduce errors and increase system complexity, yet the paper does not discuss how these errors propagate and accumulate to affect the final results.
3. The paper only compares its method with two video harmonization approaches and two trajectory-controlled generation methods, but fails to include comparisons with recent state-of-the-art methods in related tasks such as video object insertion, which share similarities with generative video compositing.
Please refer to the weaknesses. |
Lightly AI-edited |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces GenCompositor, a diffusion-transformer-based framework for generative video compositing, enabling insertion, removal, and harmonization of dynamic foreground objects within videos. The method takes as input a masked background video, a mask video, and a foreground video, along with user-defined trajectory and scale controls. The model architecture follows a MMDiT, similar to those used in FLUX, SD3, or HunyuanVideo, with full self-attention, and includes:
- A background preservation branch to maintain scene consistency,
- A DiT fusion block, which concatenates background and foreground tokens for fusion instead of using cross-attention, and
- An Extended Rotary Position Embedding (ERoPE) to mitigate layout misalignment artifacts.
Furthermore, a new dataset, VideoComp, containing 61K video triplets, is curated for training. The qualitative results are visually strong and demonstrate convincing object insertion and harmonization, though the quantitative metrics show only modest improvements over existing methods.
- **Strong engineering effort:** The system is well-implemented and technically solid. The architecture is clean and the pipeline is carefully designed.
- **Good visual quality:** Qualitative and video results are impressive, showing smooth motion and coherent integration of foreground and background.
- **Clear writing and presentation:** The paper is well-organized, easy to follow, and supported by clear figures and a detailed video presentation.
- **Practical relevance:** The task aligns well with real-world video editing workflows and could be useful for production or creative pipelines.
1. **Limited conceptual novelty**:
- The core components are incremental modifications of existing DiT architectures.
- The background preservation branch resembles ControlNet-style conditioning, which is also stated in the paper.
- ERoPE is a simple positional embedding shift; while useful, it is not conceptually innovative.
Overall, the work feels like a strong system-level integration rather than a conceptual contribution.
2. **Inadequate quantitative evaluation**:
- The paper reports only frame-level metrics (PSNR, SSIM, LPIPS, CLIP). These are not sufficient for video generation; video-level metrics such as FVD or KVD are missing.
- Quantitative improvements are relatively small, making it difficult to assess significance.
3. **Weak baseline comparisons**:
- The compared baselines (Harmonizer, VideoTripletTransformer) are both video harmonization methods, not true generative compositing or conditional video generation systems.
- DynVFX can be used for comparisons. Although it cannot add directly the reference objects, it can add objects through text-prompts, which will also strengthen the paper's results.
4. **Dataset concerns**:
- The construction of the new VideoComp dataset is clearly described, combining 409K source videos into 61K compositing triplets via a semi-automatic pipeline (SAM2 + CogVLM + Qwen). While this is a solid contribution, the paper provides limited analysis of dataset characteristics such as category diversity, motion distribution, or domain coverage, which limits understanding of how well the dataset covers diverse motion and scene types. Including a brief quantitative summary would strengthen this part.
5. **Conceptual framing and scope**:
- The paper frames generative video compositing as a new task, but it heavily overlaps with existing controllable video generation and video inpainting setups.
- The results without the fusion block are much worse. Could this be due to the limited representational power of the VAE encoder? I would like to see additional results where a stronger feature extractor, such as DINO or CLIP, is used for the cross-attention variant, as this could provide more meaningful semantic conditioning.
- The model depends strongly on mask and trajectory inputs. How flexible is it when these are imprecise or unavailable? Can it generalize to more unconstrained compositional setups?
- How well does it handle multiple objects or occlusion?
- The paper mentions luminance augmentation for harmonization. Is it sufficient for handling complex illumination changes, or does it fail in extreme lighting conditions? |
Fully AI-generated |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
GenCompositor introduces generative video compositing, an automated task to inject dynamic foreground elements into background videos via user-specified trajectories/scales. Its core is a Diffusion Transformer pipeline with three key components: a lightweight background preservation branch ensuring input-output consistency, DiT fusion blocks with full self-attention for foreground-background token fusion, and Extended Rotary Position Embedding (ERoPE) addressing layout misalignment. Trained on the 61K-video VideoComp dataset, it outperforms video harmonization (e.g., Harmonizer) and trajectory-controlled generation (e.g., Tora) methods in PSNR, SSIM, and motion smoothness. Ablation studies validate the necessity of each component. GenCompositor enables seamless, interactive video editing with realistic foreground-background interactions.
1. The work addresses an underexplored yet challenging setting: interactive, adaptive injection of foreground identity and motion into target videos, enabling fine-grained control over size, trajectories, and other dynamic attributes.
2. The showcased results demonstrate strong perceptual quality and temporal coherence, suggesting the method’s practical utility.
3. The release of implementation details and code substantially improves reproducibility and facilitates future research.
4. The construction of the 61K-pair VideoComp dataset represents a significant data-engineering contribution and a useful resource for benchmarking video composition.
1. In my opinion, the innovation of the module is more of an engineering improvement rather than a fundamental breakthrough. Specifically:
- For Sec.4.1, the proposed Input Conversion primarily involves operations like scaling and expanding the mask foreground object, which are quite common in existing methods.
- For Sec.4.2, the use of a black area to indicate the modified background and a binarized mask video to represent the masked area is also a common practice in prior work.
2. Although the RoPE alignment appears to be a promising idea for alignment, the authors did not provide sufficient elaboration on this design.
3. Additionally, it is suggested to revise Fig.3(A) to align with the paper's description by using a mask video instead of a foreground video, better demonstrating the unique application of ERoPE.
1. Sec.4.3 proposes a new token-wise concatenation method. However, what differentiates it from in-context learning approaches like VACE [Jiang2025 et al.]? Perhaps I missed some details, but I did not find a detailed explanation in the ablation study or supplementary materials.
2. After reviewing the supplementary material on your data construction, I have a question: how do you ensure that the foreground data strictly adheres to the user-provided trajectories, especially when considering variations in angle and direction? This raises concerns about the strictness of the task’s generalizability. |
Lightly AI-edited |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes GenCompositor, a generative video compositing method that can adaptively inject foreground video with user control into the background video. The proposed framework was developed based on the diffusion transformer pipeline, allowing users to control the size, trajectory of the foreground objects, and seamlessly preserve them in the background video. To validate the effectiveness of this proposed framework, the paper also curated a dataset called VideoComp. The experiments have been validated to compare against both trajectory-controlled video generation and non-trajectory-controlled video generation methods.
Overall, this paper proposes a new task, i.e., generative video compositing, which has not been well addressed before. This differentiates from existing works on video compositing as it utilizes the generative models to complete the task. From this perspective, the proposed task is similar to the existing one, but tackles the problem from a different technical perspective.
In addition to the task, the paper shows some strong points in curating the dataset required to evaluate on this task. If released, this should help further advance the research in this field. The paper offers a good motivation for why this problem is important and presents the components in the pipeline relatively well. Below, I will show the weak points that need particular attention.
The experimental visualization was not quite clear. For instance, by checking on Figure 1, what is the foreground object on the left top example of the figure? The depicted trajectory is also unclear in this case. It will be nice to have high-resolution and good-quality results. Similarly in Figure 7, the trajectory is not very clearly depicted.
The paper mentions the interaction between the added foreground object and the background, citing the explosion effect in Figure 1(a) as an example. This looks very interesting. However, I don't believe this paper presents sufficient technical details to explain why that happens. Usually, we will expect the added element to have a static presence. But this claim shows that the newly added element can adapt to the background content. This part is not supported by the technical details, at least not clearly presented. This is related to mask inflation, but not entirely.
Regarding the input conversion, the user trajectory has been used to generate the mask video and masked video. The trajectory itself was not directly encoded. It will be nice to directly state what the conditions of "c" are in Eqn. (2).
Based on the workflow of GenCompositor in Figure 2, the masked video and its corresponding mask video has been concatenated as the input. This indicates that the GenCompositor deals with the layout-aligned conditions. In Section 4.4, where ERoPE was discussed, the problem has been shifted to the layout-unaligned conditions. This creates some gaps in understanding, as the workflow in Figure 2 does not incorporate the layout-unaligned conditions. Will the layout-unaligned conditions change the workflow of the proposed GenCompositor? This creates some confusion here.
The role of the inflated mask could be better clarified. Initially, I was thinking that the mask regions were expanded to increase the size of the foreground object, if there is an imperfect foreground mask. A Gaussian filter was used to inflate the mask video. When looking at Figure 8, I have also noticed that the inflated mask was naturally generated as the foreground object changes.
In summary, the paper tackles an interesting task, and the presented pipeline is also feasible to address the proposed task. There are some technical contributions regarding the proposed pipeline, as well as the ERoPE part. The proposed dataset is also a good contribution. Therefore, the paper looks positive. However, the paper also shows the concerns listed above, which need to be addressed well. The corresponding answers to address these concerns would be critical for the final decision.
1. The interaction between the added foreground object and the background is unclear. How this is related to the technical component in the proposed pipeline needs to be justified.
2. The input conversion procedure is unclear and needs to be further explained.
3. The relationship between layout-unaligned conditions and its proposed pipeline is not clearly justified.
4. The role of the inflated mask and its generation process needs to be better explained. |
Fully human-written |