|
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the significant computational and memory costs (quadratic complexity) of the attention mechanism in long-context Large Language Models (LLMs), which is a major bottleneck, especially in resource-constrained environments.
Key Contributions:
* Dynamic Segmentation: DHSA first segments the input sequence into variable-length "chunks" based on the content itself. This is more adaptive than using fixed-size blocks.
* Hierarchical Computation: It then computes representations for these chunks using a special "length-normalized" method to avoid bias from different chunk sizes. It calculates similarity scores at this coarse, chunk-to-chunk level.
* Token-Level Upsampling: Finally, it upsamples these chunk-level scores to the token level to create an importance map. This map determines which fine-grained token-to-token attention scores are actually computed, preserving only the most impactful ones.
* Efficient long-context handling: Matches dense attention accuracy while cutting prefill latency by 20–60% and peak memory usage by 35% at 8K context, and scales to 100K context on a single 24 GB GPU (where dense kernels fail).
* Input-adaptive sparsity: Avoids rigid static patterns or heuristics; dynamically predicts attention sparsity via data-driven chunking and similarity, adapting to diverse tasks/inputs.
* Easy integration: Functions as a drop-in module for standard decoder-only Transformers, requiring no retraining or architecture changes to the base LLM.
* Robust chunk representation: Uses length-normalized aggregation to eliminate bias from variable chunk lengths, ensuring reliable similarity estimation.
* Hyperparameter dependence: Its performance relies on hyperparameters like the number of chunks and preserved keys, whose optimal settings vary across models, tasks, and hardware, lacking adaptive allocation strategies.
* Boundary predictor constraints: The boundary detector requires training on specific datasets (e.g., Long Data Collections) and may need adjustments for diverse text types, introducing potential generalization gaps.
* Hardware adaptability limitations: While tested on NVIDIA GPUs, its performance on other hardware (e.g., CPUs, edge devices) is not evaluated, raising questions about cross-hardware applicability.
* In the ablation study, DHSA without dynamic chunking degrades to standard block-sparse attention, showing the critical role of dynamic chunking. However, the paper does not compare DHSA with recent advanced dynamic chunking methods (e.g., context-aware adaptive chunking). How does DHSA’s chunking strategy perform relative to these methods in terms of segmentation accuracy and computational efficiency?
* The boundary predictor uses soft labels derived from attention scores and focal BCE loss for training. If the base LLM itself has biased attention distributions (e.g., over-attending to trivial tokens), will this bias be transferred to the boundary predictor, affecting chunk quality? How to mitigate such potential bias propagation? |
Fully AI-generated |
|
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Due to the quadratic nature of attention, prefill attention becomes the key bottleneck for long-context inference. Existing systems prune tokens based on heuristics or pre-chosen patterns, which limits accuracy. This paper proposes **DHSA**, which trains an MLP layer to dynamically predict the boundaries of token chunks and uses dot similarity to model interactions between chunks. The actual attention is then computed only on highly relevant chunk pairs. DHSA achieves good accuracy and speedups on long-context inference.
- The dynamic partitioning of tokens is a novel and effective idea.
- The accuracy evaluation results look promising.
- DHSA requires training an MLP layer to predict chunk boundaries, making it harder to deploy than existing methods.
- The efficiency evaluation is not very comprehensive.
Thanks for submitting to ICLR 2026. The paper introduces an interesting idea of dynamically partitioning sequences to group similar tokens into the same chunk. However, I still have some concerns about the paper.
- Firstly, since DHSA involves training, it would be more fair to compare it with other methods that also train a small predictor, such as DSA or SeerAttention. These should provide much stronger performance than the current baselines. Additionally, DuoAttention may also be a good baseline, especially at 12.5% or 25% sparsity.
- Moreover, the upsampling process seems to violate the assumption of partitioning. Theoretically, similar tokens should already be grouped together, and we should expect clear boundaries between chunks.
- Additionally, during MLP training, it is unclear what \( f_{MHA} \) represents. Which **Q** vector is being used in this computation?
- Regarding efficiency, the evaluation is based on PyTorch implementations. FlashAttention might be a better baseline for fair comparison. It is also unclear how to efficiently implement block-sparse attention given the dynamic chunk sizes.
- For inference, since DHSA treats all newly generated tokens as a single chunk, what happens in long-generation tasks? Will this chunk grow too large and degrade performance? |
Lightly AI-edited |
|
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Dynamic Hierarchical Sparse Attention (DHSA), an inference-time, drop-in sparse attention module for decoder-only LLMs. DHSA first dynamically chunks the sequence via a learned boundary predictor, then builds length-normalized chunk representations, computes chunk-chunk similarities, upsamples these scores back to the token level, and finally selects Top-Nb token interactions for each query. The method targets on-device settings and reports LongBench accuracy competitive with dense attention but having lower latency.
1. This paper tries to tackle an important problem of how to improve the efficiency of LLM inference in long context by leveraging sparsity in attention.
2. A clear hierarchical routing formulation with a concrete sparse attention pipeline. The design and implementation details are explained well.
3. The paper reports accuracy improvements over existing static sparse attention baselines and lower latency over full dense attention.
1. Missing comparisons to other more recent dynamic sparse baselines. Current baselines are mostly static patterns on static template.
2. Missing upper bound analysis with oracle top-k baseline to show how close the number of tokens selected is to the optimal choice. Missing latency comparison with baselines other than dense attention.
3. Still not clear why dynamic chunking is needed if there is an accurate way to estimate the contribution of each chunk to the overall attention.
4. Not clear how the system performs under batching settings.
The paper did a comprehensive analysis and evaluation with static sparse attention baselines, including StreamingLLM, MInference and Block Sparse, but misses important dynamic sparse attention baselines. For example, [MagicPig](https://arxiv.org/abs/2410.16179) uses LSH sampling to select tokens for attention computation dynamically. [Quest](https://arxiv.org/abs/2406.10774) exploits query-aware sparsity that keeps track of minimal and maximal key values in the KV cache and estimates importance based on queries. Without a comparison with these state-of-the-art sparse attention baselines, it is hard to fully evaluate the benefits of the proposed approach.
It is not clear why dynamic chunking is needed, even though an ablation is provided. The ablation shows cases of DHSA without robust chunk representation and without dynamic and robust chunk representation. However, to demonstrate that dynamic chunking is indeed needed, it should further evaluate the case of DHSA with robust chunk representation and without dynamic chunking. Robust chunk representation is a normalized prefix sum for queries and keys in the chunk and should work independently of the chunk size selected. Also, I can imagine there are other ways to estimate the chunk similarity, for example, based on different clustering methods. However, the paper does not provide an evaluation of them other than the normalized prefix sum one.
There are also some evaluations missing in the paper. For example, it should provide an upper-bound analysis compared with the oracle top-k baseline on the number of tokens selected. In addition, performance numbers for the batching scenario are not evaluated.
1. How does DHSA perform in terms of accuracy compared to dynamic sparse attention baselines under the same sparsity setup?
2. Can you give some intuitions on why the boundary predictor is designed as this? For example, why is the left and right window not overlapped?
3. Can you show a comparison of DHSA with robust chunk representation and without dynamic chunking as an ablation? Have you evaluated other methods that could estimate the chunk similarity other than the current approach?
4. Can you provide the evaluation performance of DHSA under the batching scenario?
5. How are the ratios and hyperparameters of the baselines selected? Can you provide latency comparison with baselines? Can you provide number of tokens selected in the optimal case?
6. In Table 2 under sparsity = 25\%, how can DHSA outperform dense attention on LongBench Synth by such a large margin?
7. In Table 3 why is more memory needed for DHSA? Is it for storing the model weights of the boundary predictor? |
Fully human-written |
|
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a Dynamic Hierarchical Sparse Attention (DHSA) mechanism to make long-context inference efficient for large language models, especially when deployed on devices with limited memory and compute power. DHSA dynamically detects hierarchical attention boundaries and prunes redundant computations, adapting sparsity in real time based on token relevance. Unlike fixed sparsity or static compression approaches, it introduces a multi-level adaptive mechanism that balances local and global context retention. The method shows strong results—maintaining accuracy close to dense attention while significantly reducing latency and memory use across benchmarks such as LongBench and Needle-in-a-Haystack. The paper’s contribution is practical and well-aligned with the need for efficient, scalable, and on-device LLM deployment, demonstrating that dynamic hierarchical sparsity can effectively enable longer-context reasoning without sacrificing model performance.
This paper introduces a technically elegant and well motivated solution to one of the most critical bottlenecks in modern LLMs efficient long-context inference. The proposed DHSA framework combines dynamic boundary detection with hierarchical sparsity prediction, achieving strong accuracy along with efficiency trade-offs across tasks such as LongBench and Needle-in-a-Haystack. Its design as a training-free, drop-in module makes it immediately applicable to on-device. The empirical results show consistent latency and memory gains while maintaining dense-attention-level accuracy. The presentation is thorough, with sound motivation, clear algorithmic exposition, and reproducible implementation details.
Despite the contribution being incremental relative to recent dynamic sparsity and KV compression literature (e.g., MInference, H2O, PyramidKV), with limited theoretical grounding for why hierarchical chunking yields near-optimal sparsity prediction.
The dependency on hyperparameter tuning for chunk size and sparsity budgets limits generalizability across architectures and devices.
The method’s scalability beyond 100K context is mentioned but needs to be empirically validated. The experimental evaluation could be broadened with larger models or real-world application benchmarks (e.g., RAG or document retrieval tasks).
1. How does the method ensure that important global information isn’t lost when dynamically pruning attention? Can the authors show examples or quantitative evidence that key tokens are always retained?
2. The paper claims DHSA works well for on-device inference. Can the authors provide more details on the actual hardware setup or latency improvements in real deployment, not just simulated benchmarks?
3. DHSA is stated as “no retraining,” but boundary predictors are trained offline, though its a lightweight and doesn't touch the base model weights, but it’s not literally zero learning? |
Fully AI-generated |