https://www.youtube.com/watch?v=zt1jlTPCaps Here is a comprehensive Markdown summary of the video regarding DeepSeek’s “Engram” paper.

DeepSeek Engram: Conditional Memory via Scalable Lookup

Paper Title: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

1. The Core Problem: Wasteful Computation

The fundamental inefficiency in current Transformer architectures is that they do not distinguish between tasks requiring deep thought and tasks requiring simple recall.

The Issue: When an LLM recalls a simple fact (e.g., “Paris is the capital of France”), it runs through the same dozens of transformer layers as it does for a complex reasoning task (e.g., “Why did the Roman Empire fall?”).
The Reality: LLMs are using deep computation to simulate a hash table.
Previous Research: Research from 2021 (Geva et al.) showed that Feed-Forward Network (FFN) layers already function as Key-Value memories. DeepSeek builds upon this by asking: Why simulate a hash table when we can just give the model one?

2. The Solution: Engram

DeepSeek introduces a new architecture called Engram to complement the existing Mixture-of-Experts (MoE) architecture. It creates two axes of sparsity:

Computation Axis (MoE): For complex, compositional reasoning.
Memory Axis (Engram): For static pattern recall.

Language Has Two Types of Dependencies:

Compositional Reasoning: Dynamic, context-dependent, requires synthesis (e.g., “Why did the stock market crash?”). $r i g h t a rro w$ Handled by MoE.
Static Patterns: Local, stereotyped, frequent (e.g., “Alexander the Great,” “By the way”). $r i g h t a rro w$ Handled by Engram (O(1) Lookup).

3. How Engram Works

The mechanism is a hybrid of neural processing and classic lookup tables:

N-Gram Extraction: The model extracts 2-grams and 3-grams from the input text (e.g., “Alexander the Great”).
Hashing: These grams are hashed to find indices.
Massive Embedding Table: The indices point to a massive table (billions of parameters) to retrieve embeddings.
Context-Aware Gating (Crucial):
- Since hash lookups can be wrong (collisions) or words can have multiple meanings (e.g., “Apple” the company vs. “apple” the fruit), a gate determines if the retrieved memory fits the current hidden state.
- If it fits: The gate opens, and the memory flows through.
- If it doesn’t: The gate suppresses the memory.

4. The U-Shaped Scaling Law

DeepSeek investigated how to split parameters between MoE and Engram. They found a U-shaped curve regarding validation loss:

100% MoE: Suboptimal (wastes depth on static patterns).
Too much Engram: Suboptimal (loses reasoning capability).
The Sweet Spot: 75-80% parameters to MoE, 20-25% to Engram.

5. Results & Performance

Comparing Engram-27B vs. MoE-27B (same parameters, FLOPs, and training data):

Knowledge Gains: MMLU (+3.0), Chinese Knowledge (+4.0).
Reasoning Gains (Surprising): BBH (+5.0), ARC-Challenge (+3.7), MATH (+2.4).

Why did Reasoning improve more than Knowledge?

The paper suggests the model became effectively deeper.

In standard models, the first ~6 layers are wasted on “Static Reconstruction” (figuring out that “Diana, Princess of Wales” is a single entity).
In Engram, this happens via O(1) lookup instantly.
Result: All 30+ layers are now available for actual reasoning. CKA analysis shows Layer 5 of Engram produces representations equivalent to Layer 12 of the baseline MoE.

Long Context Improvement

Multi-Query Need-In-A-Haystack: 84.2 $r i g h t a rro w$ 97.0
Because Engram handles local dependencies, the Attention mechanism is freed up to focus on long-range, global dependencies.

6. The Hardware Story (Economics)

This architecture is highly optimized for inference costs:

Deterministic Lookup: Unlike MoE routing (which is dynamic), Engram lookup is known from the input tokens alone.
Prefetching: While the GPU computes Layer 1, the CPU can prefetch Engram embeddings for Layer 2.
Cheap Memory: The massive embedding table can live in Host RAM (cheap), not GPU VRAM (expensive).
Throughput: Less than 3% throughput penalty even with a 100B parameter table in system RAM.

7. Limitations

Hash Collisions: Mitigated by gating, but not eliminated.
Static Embeddings: They do not adapt during inference (unlike dynamic attention).
Not RAG: This is not retrieval-augmented generation; it does not connect to an external live database (internet/files). The lookup table is fixed after training.
Limited Order: Currently uses only 2-grams and 3-grams; longer patterns might be missed.

Summary: The Bigger Picture

Engram represents a shift toward Separation of Concerns in AI architecture, mirroring human cognition:

System 1 (Engram): Fast, automatic pattern matching.
System 2 (MoE): Slow, deliberate reasoning.

Core Insight: “Using a calculator to store phone numbers works, but it is wasteful. Engram gives LLMs an address book.”

DeepSeek Engram — Wikipedia

NemoClaw Knowledge Wiki

Explorer

DeepSAeek Engram paper - Prompt Engineering channel

DeepSeek Engram: Conditional Memory via Scalable Lookup

1. The Core Problem: Wasteful Computation

2. The Solution: Engram

Language Has Two Types of Dependencies:

3. How Engram Works

4. The U-Shaped Scaling Law

5. Results & Performance

Why did Reasoning improve more than Knowledge?

Long Context Improvement

6. The Hardware Story (Economics)

7. Limitations

Summary: The Bigger Picture

Graph View

Table of Contents

NemoClaw Knowledge Wiki

Explorer

DeepSAeek Engram paper - Prompt Engineering channel

DeepSeek Engram: Conditional Memory via Scalable Lookup

1. The Core Problem: Wasteful Computation

2. The Solution: Engram

Language Has Two Types of Dependencies:

3. How Engram Works

4. The U-Shaped Scaling Law

5. Results & Performance

Why did Reasoning improve more than Knowledge?

Long Context Improvement

6. The Hardware Story (Economics)

7. Limitations

Summary: The Bigger Picture

Related Concepts

Related Entities

Graph View

Table of Contents