Prompt Caching

Prompt Caching is an optimization technique in Large Language Model (LLM) inference that stores the results of previously processed tokens (specifically Key-Value Cache states) to avoid recomputation during subsequent requests with identical or overlapping prefixes. This significantly reduces latency and computational costs for repetitive context loading.

Mechanism

  • KV State Reuse: Instead of re-running the Transformer forward pass for every token in the prompt, the system retrieves pre-computed hidden states for cached segments.
  • Prefix Matching: Effective when queries share long common prefixes (e.g., system prompts, document contexts, or multi-turn conversation history).
  • Hardware Efficiency: Reduces demand on GPU memory bandwidth and compute units, allowing higher throughput per unit of hardware.

Strategic Impact & Innovations