ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 4648
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 2.00 5.00 3042
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 4.00 2042
Total 4 (100%) 3.00 4.25 2944
Title Ratings Review Text EditLens Prediction
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents FlowKV, a training-free mechanism for managing KV caches in multi-turn conversational LLMs. Unlike standard eviction or compression strategies that repeatedly re-compress historical cache entries, FlowKV introduces a multi-turn isolation mechanism that preserves previously compressed states and only applies compression to the newly generated turn. The method is lightweight, compatible with any existing KV cache compressor, and aims to maintain contextual coherence and preference retention in long dialogues. Extensive experiments on Multi-IF and PrefEval benchmarks show consistent improvements, e.g., boosting instruction-following rates by 20–40% and preference retention from ~10% to ~75% compared to baselines like SnapKV or ExpectedAttention. FlowKV introduces negligible runtime overhead and is compatible with both LLaMA-3.1-8B and Qwen-2.5-7B models. 1. The proposed isolation mechanism is intuitive and directly addresses the issue of cumulative compression loss in multi-turn LLM interactions. 2. It requires no retraining and can be combined with any existing KV compression method. 3. The figures are informative and greatly aid in understanding the proposed method and experimental results. 1. While the method is well-motivated, the theoretical section (Appendix D) remains descriptive rather than analytical. A more formal quantification of “information degradation under repeated compression” would strengthen the contribution. 2. The study primarily focuses on instruction-following and preference tasks. Additional experiments on open-domain dialogue or reasoning datasets (e.g., LongBench, SCBench full set) would improve generalization claims. In particular, the latency analysis is conducted with only 8K tokens, which is insufficient to reflect the performance characteristics of modern long-context models. 3. The core mechanism, isolating KV compression by excluding already-compressed states, is a relatively minor modification of prior methods such as ExpectedAttention. Essentially, it changes the compression scheduling from “compress the entire history each turn” to “compress only the new uncompressed portion,” which may be seen as an incremental rather than a conceptual advance. 4. The experimental comparison omits several recent long-context optimization methods that are closely related, even if not multi-turn–specific, including PyramidKV, ArkVale, and PQCache. These techniques also address KV efficiency and contextual preservation and could provide a stronger and fairer baseline for FlowKV’s evaluation. 5. The paper lacks ablations isolating each component of FlowKV (e.g., what if isolation occurs every n turns, or partial re-compression is allowed). Such analysis could clarify where most of the gains originate. 6. Some mathematical equations could be refined (e.g., in Eq. 7-9, I cannot find the definition of F). Minor typographical issues persist (e.g., “Mehtods” in Figure 1 caption, “Futhermore” in Section 2.2). Please check the limitations mentioned above. Moderately AI-edited
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the degradation of multi-turn conversational coherence in large language models (LLMs) caused by repeated compression of historical key-value (KV) caches. The authors propose FlowKV, a "multi-turn isolation" mechanism that preserves previously compressed KV caches and only compresses new tokens in each conversation turn. The method is training-free, compatible with existing KV compression techniques, and shows performance improvements on Multi-IF and PrefEval benchmarks. - Addresses a real and important problem in multi-turn efficiency for LLMs, namely the recursive compression and cumulative information loss across dialogue turns. - The proposed approach is simple, general, and easy to integrate with existing KV compression methods (e.g., SnapKV, ChunkKV, Expected Attention). 1. The claimed "multi-turn isolation mechanism" is essentially a straightforward and obvious engineering adaptation of existing frameworks such as `kvpress` to multi-turn settings. This is a natural and expected implementation choice when extending any prefilling compression method to multi-turn use. The paper does not introduce a new compression function or theoretical principle; instead, it modifies the scheduling of existing operations. Hence, the core novelty is minimal. A **deeper** exploration of how importance-based eviction schemes inherently handle (or fail to handle) the re-compression problem in multi-turn settings would strengthen the paper's positioning and clarify the novelty of the proposed turn-level isolation framework, especially given that FlowKV modifies the management process rather than the compression algorithm itself. 2. The baselines are originally designed for prefilling. When directly reused in multi-turn settings without appropriate adaptation, they naturally suffer from repeated compression. Comparing FlowKV against such unadapted baselines inflates the relative gain and makes the comparison less fair. 3. Only 3 runs with no error bars/significance. PrefEval relies solely on GPT-4o as judge, which may introduce subjectivity and bias. To strengthen the empirical evidence, the paper should include more objective and diagnostic benchmarks, such as "Needle in a Haystack" or long-context retrieval tasks. 4. As depicted in Fig. 3, FlowKV only compresses the system prompt, while keeping the query and response KV uncompressed within that turn. The paper could benefit from a clearer more comprehensive discussion since many previous works compress all of them dynamically. 5. Algorithmic specification is underspecified. No clear pseudocode or exact buffer layout to ensure “isolation” across turns; Eq. (12) contradicts the narrative by writing F(C1’) rather than selectively compressing only the latest uncompressed segment. 6. Ambiguity about how “prompt-related” segments are defined and tracked in the KV buffers turn-by-turn (token boundaries, role tags, streaming generation). Missing technical details for integration: positional encoding handling after compression, attention index remapping, cross-layer consistency, chunk boundaries with ChunkKV, and how kvpress APIs are used to maintain per-turn segments. 7.Limited analysis isolating the claimed cause (recompression) beyond empirical gains; no ablation that compresses “all-but-last-turn” or “compress-every-N-turns” to validate the isolation hypothesis. - Please provide precise pseudocode and a memory layout schematic: how are per-turn KV segments stored, tagged, and selectively compressed without touching older segments? - Eq. (12) suggests compressing C1’ wholesale. Should it be C2 = [F(KV(Q1 ⊕ R1)) ⊕ F(KV(Psys))] ⊕ KV(Q2)? Clarify the exact operation and which parts are recompressed vs preserved. - How do you handle positional encodings and attention index remapping after compression for methods that reduce token count (e.g., SnapKV, ChunkKV)? - How are “prompt-related” tokens identified when responses stream? Do you re-slice after generation or pre-allocate buffers per role? - What hyperparameters (e.g., window sizes, head selection, chunk sizes) and kvpress settings are used per method, and how is fairness ensured across baselines? - Can you add ablations comparing: (a) compress-only-last-turn vs (b) compress-all-history vs (c) compress-every-N-turns, holding ratio fixed, to directly test the recompression hypothesis? - Please report statistical significance and judge agreement variability on PrefEval, and test an open-source judge for robustness. - How does memory evolve over 6–10 turns, including fragmentation or overhead from maintaining multiple preserved segments? Fully AI-generated
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces FlowKV, a training-free, multi-turn isolation mechanism for Key-Value (KV) cache management in large language models (LLMs). Instead of conventional multi-turn conversations, where existing cache compression or eviction strategies suffer from repeated re-compression of early context and hence information degradation and coherence loss, FlowKV addresses isolates previously compressed caches and applies compression only to newly generated KV pairs at each turn, preventing cumulative information loss.The authors benchmark FlowKV on Multi-IF (instruction following) and PrefEval (preference retention) datasets using LLaMA-3.1-8B and Qwen-2.5-7B, showing visible performance gains over the baselines. S1. This paper tackles an important problem of efficient multi-round conversion. S2. The approach proposed in this paper is simple and easy to understand. S3. The experiments were conducted over a range of different baselines. W1. The judge LLM, GPT-4o, is a legacy model, older than the evaluated Llama3.1 and Qwen2.5 models. I would suggest to user a new SOTA model as a judge. W2. The baseline LLMs are somehow outdated. Llama3.1 is fine but Qwen2.5 should be replaced by Qwen3. W3. Compressing each round individually seems to have the problem of lower compression ratios compared with re-compression over the entire history, which might be less effective in terms of space saving and extremely long-round memory. Q1. Can you please also evaluate the performance of FlowKV on long-turn (>5) conversation benchmarks, like SCBench, DialogLM, Conversation Chronicles, or others? Q2. Can you further explain how turn-specific compression can work together with all the compression techniques mentioned in the paper? For example, StreamingLLM proposes the attention sink, how would FlowLLM find and work with the attention sinks in each round? Q3. One important use case of KV cache compression is not only for muti-turn conversation, but also for long context or long generations tasks. Can you discuss if FlowKV can also be used in these scenarios? Q4. Can you try the performance of FlowKV on a larger models, e.g., at least on the medium size of 32B? Fully human-written
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission investigates the problem of KV Cashing in LLMs for multi-turn contexts. Based on the observation that SOTA approaches compress the earlier parts of the query-response history more often than the later parts, the authors propose a simple modification that ensures that all parts are compressed only once. Experiments were performed on the Multi-IF dataset and the PrefEval, using a LLaMA and a Qwen LLM. The topic is timely due to the rise of Agentic AI. The observation that SOTA approaches compress the earlier parts of the query-response history more often and than later parts is fairly obvious but may not yet have been exploited in the literature. The presentation is overall clear, except some questions listed below. The proposed modification of the SOTA approach (in each step compress only the parts that have not been compressed) is straightforward. Compressing each part only once seems to increase the required cache size, which needs to be discussed and experimentally evaluated. Experiments with other SOTA KV Cache methods such as TOVA and KeyDiff would strengthen the argument that the approach of FlowKV generalizes well. The authors performed experiments for prompt length 8192 and output length 4096 (Table 3). Experiments with much (10 times?) longer contexts are required to demonstrate the applicability of FlowKV in multi-turn contexts. The Theoretical Analysis in Appendix D does not add value to the paper, essentially just repeating the description of the proposed method. How does FlowKV affect the required cache size? That should be discussed and evaluated experimentally. How does FlowKV perform for much longer context (prompt + output) lengths? How does FlowKV perform for SOTA KV Cache methods such as TOVA and KeyDiff? How did you set the hyperparameters of the baselines? Please, provide complete definitions of the evaluation metrics. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next