Prompt Caching
Prompt Caching is an optimization technique in Large Language Model (LLM) inference that stores the results of previously processed tokens (specifically Key-Value Cache states) to avoid recomputation during subsequent requests with identical or overlapping prefixes. This significantly reduces latency and computational costs for repetitive context loading.
Mechanism
- KV State Reuse: Instead of re-running the Transformer forward pass for every token in the prompt, the system retrieves pre-computed hidden states for cached segments.
- Prefix Matching: Effective when queries share long common prefixes (e.g., system prompts, document contexts, or multi-turn conversation history).
- Hardware Efficiency: Reduces demand on GPU memory bandwidth and compute units, allowing higher throughput per unit of hardware.
Strategic Impact & Innovations
- Cost Reduction: Major driver for lowering API pricing by reducing the marginal cost of each inference call.
- DeepSeek Implementation:
- Detailed in DeepSeek’s LLM Price Cuts: Prompt Caching and KV State Innovations.
- Utilizes advanced inference-optimization management to maintain state efficiency without excessive memory overhead.
- Enabled competitive pricing strategies despite industry-wide cost increases, demonstrating that efficiency gains can offset raw compute costs.
Related Concepts
- Key-Value Cache
- speculative-decoding
- inference-optimization
- context-window