https://www.youtube.com/watch?v=zt1jlTPCaps Here is a comprehensive Markdown summary of the video regarding DeepSeek’s “Engram” paper.
DeepSeek Engram: Conditional Memory via Scalable Lookup
Paper Title: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
1. The Core Problem: Wasteful Computation
The fundamental inefficiency in current Transformer architectures is that they do not distinguish between tasks requiring deep thought and tasks requiring simple recall.
- The Issue: When an LLM recalls a simple fact (e.g., “Paris is the capital of France”), it runs through the same dozens of transformer layers as it does for a complex reasoning task (e.g., “Why did the Roman Empire fall?”).
- The Reality: LLMs are using deep computation to simulate a hash table.
- Previous Research: Research from 2021 (Geva et al.) showed that Feed-Forward Network (FFN) layers already function as Key-Value memories. DeepSeek builds upon this by asking: Why simulate a hash table when we can just give the model one?
2. The Solution: Engram
DeepSeek introduces a new architecture called Engram to complement the existing Mixture-of-Experts (MoE) architecture. It creates two axes of sparsity:
- Computation Axis (MoE): For complex, compositional reasoning.
- Memory Axis (Engram): For static pattern recall.
Language Has Two Types of Dependencies:
- Compositional Reasoning: Dynamic, context-dependent, requires synthesis (e.g., “Why did the stock market crash?”). Handled by MoE.
- Static Patterns: Local, stereotyped, frequent (e.g., “Alexander the Great,” “By the way”). Handled by Engram (O(1) Lookup).
3. How Engram Works
The mechanism is a hybrid of neural processing and classic lookup tables:
- N-Gram Extraction: The model extracts 2-grams and 3-grams from the input text (e.g., “Alexander the Great”).
- Hashing: These grams are hashed to find indices.
- Massive Embedding Table: The indices point to a massive table (billions of parameters) to retrieve embeddings.
- Context-Aware Gating (Crucial):
- Since hash lookups can be wrong (collisions) or words can have multiple meanings (e.g., “Apple” the company vs. “apple” the fruit), a gate determines if the retrieved memory fits the current hidden state.
- If it fits: The gate opens, and the memory flows through.
- If it doesn’t: The gate suppresses the memory.
4. The U-Shaped Scaling Law
DeepSeek investigated how to split parameters between MoE and Engram. They found a U-shaped curve regarding validation loss:
- 100% MoE: Suboptimal (wastes depth on static patterns).
- Too much Engram: Suboptimal (loses reasoning capability).
- The Sweet Spot: 75-80% parameters to MoE, 20-25% to Engram.
5. Results & Performance
Comparing Engram-27B vs. MoE-27B (same parameters, FLOPs, and training data):
- Knowledge Gains: MMLU (+3.0), Chinese Knowledge (+4.0).
- Reasoning Gains (Surprising): BBH (+5.0), ARC-Challenge (+3.7), MATH (+2.4).
Why did Reasoning improve more than Knowledge?
The paper suggests the model became effectively deeper.
- In standard models, the first ~6 layers are wasted on “Static Reconstruction” (figuring out that “Diana, Princess of Wales” is a single entity).
- In Engram, this happens via O(1) lookup instantly.
- Result: All 30+ layers are now available for actual reasoning. CKA analysis shows Layer 5 of Engram produces representations equivalent to Layer 12 of the baseline MoE.
Long Context Improvement
- Multi-Query Need-In-A-Haystack: 84.2 97.0
- Because Engram handles local dependencies, the Attention mechanism is freed up to focus on long-range, global dependencies.
6. The Hardware Story (Economics)
This architecture is highly optimized for inference costs:
- Deterministic Lookup: Unlike MoE routing (which is dynamic), Engram lookup is known from the input tokens alone.
- Prefetching: While the GPU computes Layer 1, the CPU can prefetch Engram embeddings for Layer 2.
- Cheap Memory: The massive embedding table can live in Host RAM (cheap), not GPU VRAM (expensive).
- Throughput: Less than 3% throughput penalty even with a 100B parameter table in system RAM.
7. Limitations
- Hash Collisions: Mitigated by gating, but not eliminated.
- Static Embeddings: They do not adapt during inference (unlike dynamic attention).
- Not RAG: This is not retrieval-augmented generation; it does not connect to an external live database (internet/files). The lookup table is fixed after training.
- Limited Order: Currently uses only 2-grams and 3-grams; longer patterns might be missed.
Summary: The Bigger Picture
Engram represents a shift toward Separation of Concerns in AI architecture, mirroring human cognition:
- System 1 (Engram): Fast, automatic pattern matching.
- System 2 (MoE): Slow, deliberate reasoning.
Core Insight: “Using a calculator to store phone numbers works, but it is wasteful. Engram gives LLMs an address book.”
Related Concepts
- Conditional Memory — Wikipedia
- Scalable Lookup — Wikipedia
- Transformer Architecture — Wikipedia
- Large Language Models — Wikipedia