Speculative decoding

Speculative decoding is an inference acceleration technique for Autoregressive generation models that reduces latency by using a smaller Draft model to propose tokens, verified in parallel by a larger Target model.

Core Mechanism

  • Drafting: A computationally cheap model generates a sequence of candidate tokens.
  • Parallel Verification: The target model processes all candidates in a single forward pass, evaluating likelihoods simultaneously.
  • Acceptance/Rejection: Tokens are accepted if the target distribution matches sufficiently; rejection triggers sampling from the target at the first discrepancy.
  • Compute Trade-off: Minimizes expensive target model calls, improving throughput without compromising output quality.

Key Variants

  • Multi-Token Prediction (MTP): Drafters emit multiple tokens per step, increasing acceptance windows and speedup factors.
  • Self-speculative decoding: Utilizes internal states or cached predictions of the target model for drafting.
  • Early-exit speculative decoding: Uses intermediate layers of the target model to approximate drafting.
  • Stacked/Ngram Hybrid: Combines MTP with n-gram lookups to maximize draft acceptance rates, particularly effective in local inference engines like llamacpp.

Implementations & Resources