|
Transporting Tokens: Optimal-Transport View of Parallel LLM Decoding |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a novel theoretical and practical framework for accelerating autoregressive decoding in large language models (LLMs) by reframing token generation as a probability distribution transition process governed by optimal transport (OT) theory.
1 The paper elegantly bridges optimal transport theory and autoregressive decoding, formalizing token generation as a structured evolution of probability distributions. The stability analysis of hidden-state transitions (e.g., bounded OT distances) provides a rigorous foundation for multi-step prediction.
2 SHAPE’s plug-and-play design requires no draft model, reducing deployment complexity.
3 The Tree Rejection Sampling algorithm dynamically selects optimal paths with minimal overhead, balancing parallelism and quality.
1 While comparisons with EAGLE and Medusa are included, recent methods like LookaheadDecoding or Jacobi iteration are absent.
2 The speed is limited compared to Eagle-3.
3 The OT theory focuses on single-step transitions, but SHAPE uses N=3 steps. A theoretical analysis of error accumulation in multi-step predictions would strengthen the claims.
1 How does SHAPE perform on models with >70B parameters or contexts >8K tokens? Are there memory or complexity bottlenecks?
2 How does SHAPE handle distribution shifts between training data (ShareGPT, THUCNews) and unseen domains? Are there robustness guarantees? |
Fully AI-generated |
|
Transporting Tokens: Optimal-Transport View of Parallel LLM Decoding |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper reframes autoregressive decoding as a distributional transition in probability space and argues that strong temporal similarity of successive hidden states induces a stable (entropic) OT map between next-token distributions. Building on this, the authors propose SHAPE, a plug-in multi-token predictor: (i) a step-ahead hidden-state predictor (with residual + gating) optionally aligned via a low-dim OT layer solved by a few Sinkhorn iterations, and (ii) a tree rejection sampling scheme that accepts the longest valid prefix. Experiments on Qwen, Vicuna, LLaMA and DeepSeek across Alpaca/WikiLingua/MT-Bench and reasoning tasks report 5.23× speedups with ~1.2% accuracy deltas.
This work presents a method beyond draft-model speculative decoding. The paper is mostly well written.
I appreciate the architecture diagrams and a concrete tree rejection sampling algorithm.
Experiments cover wide choices of models and tasks.
Section 2 first presents an empirical study that serves as the motivation of the paper. A hidden-state similarity lower bound $\tau$ is asserted "for the vast majority of steps". These claims lack a rigorous support. From Figure 1, we also observe that $\tau$ can be very different in different models and tasks. Then Section 2.2 consists of a large paragraph introducing optimal transport. The presentation is rather casual. It is recommended to write precise mathematics when talking about formulations.
It is unclear to me how the speedup is calculated. Does it include solving optimal transport with Sinkhorn method and the overhead in tree reject sampling?
I think modeling the hidden-state distribution using optimal transport is a bit ad hoc. Although the semantic similarity suggests a limited change in distribution for consecutive or closely placed hidden states, why optimal transport is a necessary model remains unclear.
As claimed by the authors: "The observed consistency suggests that transitions between consecutive token distributions are both
small and structured." What are the structures reflected by the experiment in Figure 1? |
Fully human-written |
|
Transporting Tokens: Optimal-Transport View of Parallel LLM Decoding |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SHAPE (Step-ahead Hidden-state Accelerated Prediction Engine), a draft-free framework for parallel LLM decoding based on an optimal transport (OT) view of token generation. The authors conceptualize autoregressive decoding as a distributional transition in probability space, where hidden states evolve smoothly and predictably. By learning lightweight OT-guided predictors over hidden-state residuals and using a tree-based rejection sampling mechanism, SHAPE enables multi-token prediction without modifying model weights. Experiments on various models (Qwen, Vicuna, LLaMA, DeepSeek) and tasks (text, code, math) show up to 5.23× speedup with minimal quality loss (<1.2%), establishing a theoretically grounded and practical approach to efficient LLM inference.
- This paper offers an explanation for temporal smoothness in hidden-state evolution for auto-regressive decoding, which is interesting and novel.
- Built upon the findings, the authors design the method SHAPE that leverages the theory of optimal transport. This formulation is interesting and has theoretical proof.
- The experiments are comprehensive using multiple base LLMs on several datasets to prove the effectiveness of the method.
- The clarity of this paper needs to be improved. Some key experiment results are not clearly presented in this paper. For example, in Figure 1, what does the x-axis represent? On what dataset are the results obtained for each setting? Do you compute the average result for similarity?
- The method relies heavily on observed temporal consistency of hidden states. However, this smoothness might not hold in tasks with abrupt semantic shifts (e.g., dialogue transitions, coding completions, or multimodal reasoning) or different decoding positions, which could lead to unstable or misaligned multi-step predictions. Note that this comment also greatly relates to the first one, where the authors should demonstrate the generalization of this property.
- Please use the correct reference format. The current version negatively impacts the reading experience by using brackets.
- It would be helpful to present a decoding time breakdown for candidate generation and tree reject sampling. Since the similarity decreases as $n$ becomes larger, the tree rejection sampling time may also increase correspondingly. Readers would be curious about the trade-off.
- What is the decoding configuration in this paper? For example, what is the decoding temperature used? Does SHAPE perform equally well when the temperature is adjusted? |
Fully human-written |