|
FG-ATTN: LEVERAGING FINE-GRAINED SPARSITY IN DIFFUSION TRANSFORMERS |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces FG-Attn, a fine-grained sparse attention mechanism designed to accelerate video diffusion transformers. Existing methods rely on block-sparse attention, skipping coarse-grained tiles, which leaves much of the redundancy unexploited. FG-Attn exploits finer sparsity at the slice level (M×1) by identifying and skipping negligible query-key computations. The approach introduces a hardware-efficient asynchronous gather-load primitive that allows irregular memory access without reducing tensor core utilization, and two training-free mask prediction strategies to identify redundant slices efficiently. Experiments on large video diffusion models such as Wan 2.1 and HunyuanVideo demonstrate speedup in generation time.
1. The authors propose a novel fine-grained sparse attention approach that meaningfully extends beyond block-sparse paradigms.
2. Introduces practical GPU-level optimizations (asynchronous gather-load) that effectively hide irregular memory access overheads and implement the algorithm using ThunderKitten, which is valuable.
My major concern is about evaluation. In Figure 11, the reported speedup number of SVG and RadialAttention is much lower than the number the original speedup number they reported in their paper. Meanwhile, Wan 2.1 720p and Hunyuan 720p offer similar speedup results, which is not consistent with much of previous literature.
Secondly, the authors should report PSNR or SSIM, as they are more reliable benchmarks than VBench (Table 2).
Thirdly, the paper should be specific about whether they are comparing with the FlashAttention-3 baseline or with the FlashAttention-2 baseline. Its known that ThunderKitten is able to produce high-performing kernel that performs on par with FlashAttention-3, therefore the authors need to report the real TFLOPs number in Figure 12 for more reliable comparison.
Please refer to the weakness section. |
Fully human-written |
|
FG-ATTN: LEVERAGING FINE-GRAINED SPARSITY IN DIFFUSION TRANSFORMERS |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes FG-Attn, a fine-grained sparse attention mechanism designed to accelerate diffusion transformers (DiTs) by exploiting slice-level sparsity ($M \times 1$). The authors introduce a custom GPU "gather-load" primitive to efficiently handle the resulting irregular memory access.
The primary strength of this work is its ability to apply fine-grained sparsity to accelerate inference without a noticeable degradation in output quality. As demonstrated in Table 2 and the visual examples in the appendix (Figures 16-18), the videos generated using FG-Attn are qualitatively comparable to the baseline.
1. **Significant Presentation Issues and Typos:** The paper is in poor shape and is difficult to read.
* **Lack of Focus:** A large portion of the paper is dedicated to basic, well-known background on diffusion models and standard GPU architecture (e.g., Section 2, Appendix B). This space would be far better utilized by expanding on the core novel contribution—the gather-load primitive and its implementation challenges.
* **Numerous Typos:** The paper is riddled with simple typographical errors that betray a lack of proofreading. For example:
* Line 052: "Thus,the" is missing a space.
* Line 077: "slices of can reduce" is grammatically incorrect.
* **Incorrect LaTeX Formatting:** Mathematical notation is not formatted to publication standards.
* Line 137 (and many other places): The `log` function should be typeset as `\log` ($log$ vs. $\log$).
* Mathematical terms like "data" or "tile" should be properly typeset using `\mathrm` (e.g., $p_{\mathrm{data}}$) for clarity.
* The LaTeX formatting in Lines 270-274 appears to be broken and is unreadable.
2. **Limited Novelty:** The core contribution is a hardware-aware algorithm, which the authors acknowledge. However, this contribution appears to be an incremental engineering optimization rather than a fundamental new idea. The work feels like a specialized application of the same principles (IO-awareness, overlapping compute and memory) that made **FlashAttention** successful, but applied to a different sparsity pattern (slices instead of blocks). It doesn't introduce a new conceptual framework for attention or sparsity.
3. **Limited Performance Gain:** The reported end-to-end speedups (up to 1.65x, with a 1.48x geometric mean) are moderate compared to other methods like SVG. Given the significant, non-trivial engineering effort required to design, implement, and debug a custom CUDA kernel, this performance gain feels limited and may not justify the added complexity over simpler, existing block-sparse methods, though I sincerely appreciate your effort on this.
4. **Misalignment with ICLR:** This paper's contribution is almost entirely at the systems level. The novelty lies in the CUDA kernel implementation, not in the model or the learning paradigm. A stronger paper for ICLR would **investigate the learning dynamics of DiTs** to understand *why* this fine-grained sparsity emerges and then propose methods to **leverage or induce this sparsity at the model level**. As-is, this work would be a much better fit for a systems-focused conference (e.g., **MLSys**), where a deep dive into the hardware implementation would be the main focus.
See weakness. |
Heavily AI-edited |
|
FG-ATTN: LEVERAGING FINE-GRAINED SPARSITY IN DIFFUSION TRANSFORMERS |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents FG-Attn, a fine-grained sparse attention mechanism designed to accelerate diffusion transformers (DiTs) used for realistic video and image generation. Unlike prior block-sparse approaches that operate on coarse M×M tiles (e.g., 64×64), FG-Attn exploits fine-grained sparsity at the level of M×1 slices, effectively reducing redundant attention computations. The authors address the key challenges of irregular memory access and low tensor core utilization by introducing a novel asynchronous gather-load primitive that efficiently loads only the required key/value vectors into on-chip shared memory. Experiments on state-of-the-art video models show up to 1.65× speedup (1.48× on average) on H100 GPUs, demonstrating that FG-Attn can surpass existing block-sparse attention methods in diffusion transformers.
- this paper is well-structured and clearly presented paper.
- this paper addresses an important and timely topic — improving efficiency of diffusion-based video generation models, where attention cost dominates latency.
- this paper has technically deep contribution, including a fine-grained (128×1) sparse attention kernel implementation that pushes the boundary of GPU efficiency on sparse attention.
- the proposed system–algorithm co-design is well thought out, particularly in reducing sparse-index mask generation overhead by leveraging similarity of attention maps between diffusion steps.
- workload balancing across different query blocks is unclear — fine-grained sparsity may lead to uneven computation workloads, potentially limiting scalability.
- attention mask caching strategy might not generalize well to few-step diffusion settings, where attention patterns can vary more significantly between steps.
- How does FG-Attn handle load balancing when different query blocks exhibit varying levels of sparsity?
- The paper mentions attention map similarity between diffusion steps to reduce mask generation cost. How robust is this caching strategy in few-step diffusion settings, where the similarity may drop? |
Fully human-written |
|
FG-ATTN: LEVERAGING FINE-GRAINED SPARSITY IN DIFFUSION TRANSFORMERS |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes FG-ATTN, a fine-grained attention mechanism for efficient video diffusion transformers. Instead of using coarse block-sparse attention, the method divides attention into smaller units and reorganizes tokens to better exploit computation efficiency. It also introduces a mean-query aggregation step to reduce redundant attention computation among neighboring tokens. Experiments on large video models show some speedup at the attention operator level. The paper claims that the method achieves around 1.5× acceleration without retraining.
- The topic is relevant and important for scaling video transformers efficiently.
- The paper provides a clean implementation idea for fine-grained attention.
- The writing is overall understandable, and the motivation for improving attention sparsity is reasonable.
**1.The paper mainly introduces two technical ideas, both of which are reasonable but lack some novelty.**
- The fine-grained attention implementation can be simplified. Using a token permutation method for the key matrix is enough to achieve the same result, so the proposed gather-load process is unnecessarily complicated. This fine-grained attention idea is also not new. SVG2[1] already proposed a more fine-grained sparse attention than FG-Attn using token reordering to better utilize Tensor Core efficiency. SparseAttn[2] also applied token reordering to improve sparsity in block-sparse masks.
- The observation that adjacent tokens usually have similar responses and thus can share a mean query (qmean) is reasonable but not new. SpargeAttn[2] already reported this observation, and several other methods, such as MInference[3] and SeerAttention[4], have used mean pooling combined with thresholding to predict sparse attention masks. Additionally, pooling all blocks of Q and then multiplying them by all tokens in K will incur significant overhead.
**2. The experiments are incomplete and not convincing.**
- On effectiveness, the end-to-end video quality should be compared with some baselines, but there is *no one baseline compared*.
- On efficiency, the paper compares speedup results without showing the corresponding video quality. Without evaluating end-to-end acceleration while maintaining similar quality (or at least reporting the quality of each method), the claimed improvement is not meaningful. Reporting speedup alone is misleading, since higher speed can easily come at the cost of lower generation quality.
3. The work does not compare against strong and relevant baselines such as SVG2[1] or SparseAttn[2], both of which already achieve high attention speedup and maintain quality.
4. The abstract should be within one paragraph.
[1] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
[2] SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
[3] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
[4] SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Please refer to the weaknesses section for detailed questions and suggestions. |
Lightly AI-edited |