|
KVCache-Centric Memory for LLM Agents |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes MemArt, a new KV-cache centric memory paradigm for LLM agents that replaces plaintext with direct reuse of latent KV cache blocks. Instead of re-feeding retrieved text into prompts, MemArt stores and retrieves prior computation directly in latent space which dramatically improves both accuracy and efficiency. Specifically, they propose to compress keys via a bounding box, then they compute the attention over KV blocks through normalization and aggregation over the query tokens. Finally, they append these KV blocks after injecting the positional index to start the decoding.
* The proposed KV-cache centric memory paradigm can directly reuse the calculated KV during prefill, which reduces computational overhead.
* The proposed multi-token aggregation does alleviate retrieval overhead by reducing the number of index.
* Their proposed decoupled positional encoding practically solves the issue.
* While the proposed method achieves higher accuracy and lower latency, it inevitably involves an ever growing memory size that might cause storage issue. This is due two design choices: 1) the KV cache is represented in float numbers and it scales much faster than plaintext; 2) the memory is linearly growing with no upper bound on the size.
* Another drawback of using KV cache paradigm is the generalization across models. The importance of memory sharing amplifies in multi-agent systems, where one model needs to understand the other model’s memory. It limits the scope of the paper.
* There seems to lack some experimental comparison with KV cache compression literature. I have listed several below for reference.
1. _H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models_
2. _SnapKV: LLM Knows What You are Looking for Before Generation_
3. _A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression_
* Finally, the model size experimented are limited and primarily lies around 7/8B. I believe the work benefit from validating on larger scale models such as 32B or MoE models.
* How would you discriminate your work from KV cache compression literature? |
Fully human-written |
|
KVCache-Centric Memory for LLM Agents |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes MemArt, a memory system that stores and retrieves past context directly in KV-cache space instead of plaintext. It introduces: (i) AABB-based key compression to index each KV block with min-max vectors; (ii) multi-token aggregation retrieval that scores blocks using normalized per-token relevance and then aggregates across tokens; and (iii) decoupled positional encoding that strips RoPE at storage time and re-embeds positions at reuse time to avoid positional mismatch.
MemArt’s KV-native retrieval aligns with the attention mechanism and removes prompt concatenation, which can avoid retrieval drift and preserve prefix-caching efficiency. The AABB compression is simple and allows a fast coarse filter before fine attention. The multi-token aggregation is well-motivated and the ablations (Softmax vs reciprocal-rank; Sum vs Max; block size) help isolate what matters. The decoupled positional encoding is clearly described and addresses long-context reuse failure modes.
1. Model coverage is limited for a 2025–2026 submission. Results are only on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, with no newer families and no size sweep to show scaling trends.
2. Baseline breadth is narrow. The method is compared to plaintext memory systems (Mem0, Zep), but there is no head-to-head with cache-centric and dynamic sparse attention systems that also select KV blocks (for example Arkvale, InfLLM, Quest, NSA), even though they are discussed.
3. Scope of datasets is narrow. All main results are on LoCoMo; there is no test on other long-horizon agent traces
Because MemArt stores and retrieves latent KV-cache tensors instead of text, the retrieved memory is not human-interpretable. This makes it difficult to verify what information is actually being recalled or whether retrieval errors occur. Can you provide any mechanism to improve interpretability — for example, storing lightweight metadata (token spans, summaries, or embeddings) alongside each KV block, or decoding retrieved KV tensors back into approximate text via the model’s unembedding layer? Additionally, can you report any qualitative analysis showing that the retrieved memories correspond to semantically relevant parts of the dialogue? Without such transparency, it is hard to assess whether MemArt retrieves correct information or merely benefits from implicit correlations. |
Heavily AI-edited |
|
KVCache-Centric Memory for LLM Agents |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
MemArt reframes agent memory as KVCache-centric rather than plaintext. The paper shows how MemArt stores past turns as reusable KV blocks and retrieves them by computing attention in latent space. This avoids retrieval drift and preserving prefix-caching benefits.
The system comprises (a) AABB-based key compression for each fixed-size block, (b) multi-token aggregation retrieval that scores blocks against all query tokens, (c) decoupled positional encoding that re-embeds retrieved KV without stale RoPE offsets, and (d) a managed memory pool. Compression represents each block with coordinate-wise minima and maxima, enabling fast coarse filtering. Notably, the relevance for a single token is upper-bounded via the dot-product with those bounds. For multi-token prompts, scores are first normalized per token across all blocks and then aggregated to select top-k blocks in chronological order. Retrieved blocks are concatenated with the current KV and re-encoded with unified positions, ensuring coherent attention within the current window without exceeding native limits.
MemArt's design is quite interestng. It reframes memory to be KVCache-centric with latent-space retrieval, decoupled positional encoding, and lightweight AABB key compression with multi-token aggregation. This yields a model-agnostic, plug-and-play system.
System-wise, memory-pool I/O can add non-trivial latency, and safe reuse critically depends on decoupled positional re-embedding. The issue is that, without it, long-context behavior can be non-performant.
1. I am curious, what is the precision and recall trade-off of the AABB block filter on adversarial or highly paraphrased queries?
2. What is the end-to-end latency and memory-traffic breakdown (prefill, retrieval, re-embed), and how would specialized KV hardware change the bottlenecks?
3. How does MemArt compare head-to-head with KV pruning strategies like Keyformer and MorphKV under the same memory budget and latency constraints? |
Moderately AI-edited |