|
SparCas: A Dimension-First Cascade for Efficient Long-Context LLM Inference |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Long-context decoding is challenging due to increased KV-cache pressure. Existing sparse attention methods either estimate token importance at a coarse group level, risking missed salient tokens, or incur high computational overhead. This paper proposes **SparCas**, which first extracts important ranks from the K vectors and performs partial attention with the current Q vector. Using these partial attention scores, SparCas filters important tokens and then computes full attention over this reduced set. SparCas reports strong results on long-context benchmarks and claims implementation efficiency.
- SparCas estimates the importance of each K vector and achieves strong accuracy on long-context tasks.
- The evaluation shows speedups over full attention in several settings.
- It is unclear whether SparCas achieves a balanced accuracy–speed trade-off across regimes.
- The core idea—using partial ranks from K to estimate importance—appears similar to prior work.
Thanks for submitting to ICLR 2026. The paper introduces an interesting approach that leverages partial ranks from K vectors to prioritize tokens. However, I have several concerns:
1. **Efficiency claims vs. measurements.**
The paper claims SparCas is both accurate and efficient, yet the evaluation mainly shows accuracy improvements over Quest and parity (or slight worse) vs. MagicDec, without a head-to-head **performance** comparison against those baselines. In Figure 4, the partial-score stage appears to add notable overhead and seems slower than Quest, which weakens the efficiency claim. Moreover, the performance study uses a non-GQA model; it remains unclear how SparCas performs with GQA and whether the method handles grouped queries efficiently.
2. **Limited gains at short contexts.**
Speedups at 8K context are limited. Would larger batched 8K requests improve the speedup, and if so, by how much?
3. **Novelty relative to InfiniGen.**
The key idea, that use partial ranks from K vectors to estimate the importance seems already been proposed in InfiniGen (Page 7). Could you please clarify the differences between your work and InfiniGen?
4. **Outlier-sensitive ranking.**
Since SparCas uses QK value to select “important ranks,” can outlier in K values that from other ranks also make QK product large? |
Lightly AI-edited |
|
SparCas: A Dimension-First Cascade for Efficient Long-Context LLM Inference |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper identifies the KV cache, and specifically the memory bandwidth required to read it, as the primary bottleneck for long-context LLM inference. It introduces Sparsity Cascade (SparCas), a "dimension-first" selection method to mitigate this. The method is based on the observation that token importance ranking is stable even when pruning key dimensions. SparCas uses a "prune-in-prune" design: (1) it first prunes non-critical *dimensions* using a lightweight, query-only heuristic ($|q_t|$) to create a compact key slice, and (2) it then uses this compact slice to efficiently compute partial scores and select the top-$T_b$ *tokens* for the final, full-dimension attention computation. The authors evaluate this on PG-19, LongBench, and RULER, claiming to match dense attention accuracy with significant speedups.
* **Clarity and Simplicity:** The paper is clearly written, and the proposed method is simple and intuitive.
* **Performance on Long-Prefill Tasks:** The efficiency gains on the specific benchmarks tested (long-prefill tasks like LongBench and RULER) are well-documented and show a clear speedup over the dense baseline in that context.
* **Good Ablation Studies:** The ablation studies on the dimension budget ($D_b$) and update interval ($U$) are useful for understanding the method's parameters and confirming the core observation about dimensional sparsity.
1. **Incremental Contribution:** The primary weakness is the paper's lack of novelty. The core methodological contribution, a "dimension-first" cascade that prunes dimensions using query sparsity (`|q_t|`) and then selects tokens, appears to be a reimplementation of the central idea already presented in SparQ Attention. The paper fails to sufficiently differentiate itself from this prior work, making its own contributions feel highly incremental.
2. **Critically Missing Workload Evaluation (Long-Decoding):** The paper's evaluation is entirely focused on tasks with long *prompts* (e.g., LongBench, RULER). It completely omits what is arguably a more pressing bottleneck: **long-generation scenarios** (e.g., long chain-of-thought reasoning, long-form content creation) where the prompt is short but the generated output is very long.
3. **Ignores Key Bottleneck:** In these long-decoding workloads, the KV cache grows with *generation*, and the performance of the selection mechanism at each decoding step is paramount. This is a primary bottleneck for current LLMs, and the paper provides no data on how SparCas performs here. This is a major omission that undermines the paper's claims of solving "the" inference bottleneck.
4. **Narrow Model and Task Selection:** To validate the method for complex, long-running tasks, the authors should have included experiments on models and tasks specifically designed for reasoning (e.g., DeepSeek-R1, Qwen3) on benchmarks that require long-chain reasoning. This would be necessary to validate the method's effectiveness for the critical, yet missing, long-decoding workload.
1. Can the authors explicitly detail the novel contributions of SparCas that are not already present in SparQ Attention? The core mechanism seems identical.
2. Why did the authors choose to exclusively evaluate on long-prefill benchmarks (LongBench, RULER) and omit long-generation (long-decode) workloads, such as chain-of-thought reasoning tasks?
3. Can the authors provide *any* data on how SparCas performs in a long-decoding scenario (e.g., perplexity or accuracy on a task requiring 8K+ generated tokens)? This is a critical missing piece of the evaluation.
4. Given that the method relies on a simple heuristic ($|q_t|$), how can we be sure this heuristic holds during complex, multi-step reasoning where token importance might be more nuanced than in the retrieval tasks tested? |
Fully AI-generated |
|
SparCas: A Dimension-First Cascade for Efficient Long-Context LLM Inference |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Sparsity Cascade (SparCas), which is a dimension-first cascade method for KV cache selection. SparCas first reduces the dimensions of the key matrix and converts it into a compact key matrix, with the help of a heuristic based on the query vector. Then, SparCas uses the compact key matrix to compute the partial attention scores of all past tokens. Based on these scores, the top tokens are selected, and the attention output is computed using the full-dimensional key and value vectors corresponding to the small subset of tokens. The main insight of SparCas is that reducing the dimensions of the key matrix maintains the relative importance of the tokens, and reduces the memory bandwidth required for computing attention. Reducing the dimensions further is possible using a heuristic that only uses the current query vector, which does not require access to the full KV cache. Using this, SparCas is able to outperform Quest and achieve performance close to that of Full Cache, while using <1 % of tokens at a 32K-token context. Furthermore, SparCas can deliver up to 3x faster self-attention and 1.64x end-to-end speedups compared to full attention.
+ The paper solves an important problem of GPU memory constraints in long-context sequences for LLMs.
+ The resulting performance of SparCas is impressive, with a notable reduction in the tokens at high accuracy, and a corresponding increase in speedup
+ The efficiency evaluation of the paper is nice with the breakdown of latency into individual kernel operations – this helps in understanding the efficiency of the various steps involved in SparCas
- The paper does not do a great job in explaining the intuition behind dimension reduction and the heuristic. I would have liked to see some explanation or intuition behind why the current query vector is sufficient for understanding the important key dimensions
- I would have liked to see a more thorough sensitivity study of the updated gap (U) across different models and architectures, and a discussion of how to set the update gap parameter – should we always globally set it to 64? Is it good enough for different models and architectures?
- There is also another configuration of the number of dimensions to set (Du) – I would again like to have some explanation of how to configure that parameter and a more thorough sensitivity analysis for it.
- Unless I missed it, I did not see a comparison with hashing-based query selection, only saw a comparison with Quest. Is that the only solution for KV cache selection? It would be good to compare with more baselines, other than Quest and full attention.
- Please try to answer as many questions as possible from the weakness section.
- Can you compare quantitatively or qualitatively with other baselines such as AdaKV (Feng et al 2024), PyramidKV (Cai et. al, 2024) and maybe other directly related baselines?
- Can you provide a more thorough sensitivity study for the configuration parameters to set and why it is easy to set them for users? |
Fully human-written |
|
SparCas: A Dimension-First Cascade for Efficient Long-Context LLM Inference |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces SPARCAS, a training-free, cascaded KV-cache selection method for long-context LLM inference. It exploits intra-token sparsity by pruning low-magnitude key dimensions via a lightweight query-only heuristic, then applies cross-token sparsity by computing partial scores in the reduced subspace to select top-K tokens. Integrated with FlashInfer, SPARCAS achieves near-dense accuracy on PG-19, LongBench, and RULER while delivering up to 1.6x+ end-to-end speedups on long sequence.
1. Simple but effective heuristic, the design focuses on reducing HBM bandwidth, the true bottleneck in long-context inference. The latency breakdown clearly isolates wins from partial-score GEMM and sparse attention.
2. Goo low-budget performance on RULER comparing to Quest, maintains high accuracy even at extreme compression, outperforming Quest dramatically.
3. Periodic dimension updates reduce overhead without noticeable quality degradation.
1. Layer/task dependence. The paper applies SPARCAS only from layer 3 but does not quantify how attention concentration, or optimal dimension set, varies across layers or tasks.
2. Several important families are absent from experiments, such as ShadowKV, SnapKV, H2O, while discussed conceptually, the empirical comparison is narrow, mostly vs Quest and MagicPIG.
3. Tensor-parallel integration. The paper does not explain how Dt is synchronized across TP shards, this an important omission for multi-GPU inference.
1. Did we observe variation in the kurtosis or contribution mass per layer? Is a uniform dimension set size suboptimal?
2. Does pruning affect rotated dimensions differently at long context lengths where positional phases diverge? How to interact with RoPE.
3. Should dimension importance be head-specific rather than shared? Did the authors observe clusters of heads with different dominant dimensions?
4. How robust is dimension importance across prompts (or task types)
5. Is the cascaded pruning compatible with blockwise or paging compression (PyramidKV / Quest)? |
Moderately AI-edited |