🗂️ AI & Agents · View mindmap

Prompt Caching & KV Cache Optimization

Prompt Caching is an optimization technique in Large Language Model (LLM) inference that stores the results of previously processed tokens (specifically Key-Value Cache states) to avoid recomputation during subsequent requests with identical or overlapping prefixes. This significantly reduces latency and computational costs for repetitive context loading.

Mechanism

KV State Reuse: Instead of re-running the Transformer forward pass for every token in the prompt, the system retrieves pre-computed hidden states for cached segments.
Prefix Matching: Effective when queries share long common prefixes (e.g., system prompts, document contexts, or multi-turn conversation history).
Hardware Efficiency: Reduces demand on GPU memory bandwidth by minimizing redundant calculations.

VRAM Optimization and Paged Attention

Recent advancements focus on optimizing GPU memory (VRAM) utilization to address fragmentation and improve throughput at scale:

VRAM Fragmentation: Traditional contiguous memory allocation for KV caches leads to significant waste and limits the number of concurrent requests a GPU can handle.
Paged Attention: Inspired by virtual memory management in operating systems, this technique decouples logical KV cache blocks from physical memory blocks, allowing non-contiguous allocation and reducing fragmentation.
Throughput Improvement: By efficiently managing VRAM, systems can serve more concurrent requests without increasing latency, directly impacting overall inference efficiency.

References

KV Cache and Paged Attention: Accelerating LLM Inference through VRAM Optimization (IBM Technology)

NemoClaw Knowledge Wiki

Explorer

prompt-caching

Prompt Caching & KV Cache Optimization

Mechanism

VRAM Optimization and Paged Attention

References

Graph View

Table of Contents

Backlinks