Multi-Token Prediction (MTP)

Multi-Token Prediction (MTP) is a technique in large-language-model (LLM) inference where the model predicts multiple future tokens simultaneously rather than autoregressively one-by-one. This approach reduces latency by allowing parallel processing of potential next tokens, often integrated with speculative-decoding to verify these predictions efficiently.

Key Characteristics

  • Parallelism: Predicts a sequence of tokens in a single forward pass.
  • Verification: Requires a target model to verify the proposed sequence, accepting or rejecting tokens based on probability thresholds.
  • Efficiency: Significantly boosts throughput (tokens/second) when the acceptance rate is high.

Integration with Speculative Decoding

MTP is frequently combined with other speculative decoding strategies to maximize inference speed:

  • Ngram Stacking: Combines MTP with simple N-gram lookups to quickly propose likely token sequences based on historical patterns.
  • Implementation: Tools like llamacpp support stacking these methods to leverage both neural prediction and heuristic speedups.