Demystifying AI: Transformer Training on a 1979 PDP-11
Clip title: EXPOSED: The Dirty Little Secret of AI (On a 1979 PDP-11) Author / channel: Dave’s Garage URL: https://www.youtube.com/watch?v=OUE3FSIk46g
Summary
The video, presented by Dave, aims to demyst demystify the training process of a neural network by running a transformer on a vintage 1979 44 computer. Unlike modern cloud clusters with thousands of GPUs, this system operates with a single 6MHz CPU and a mere 64KB of RAM (though later upgraded to 4MB). The core idea, Dave argues, is not magical or new; it’s the scale of modern computational power that makes it appear so. By using this “big iron,” the video intends to strip away the hype and showcase the essential machinery of a neural network learning.
The project, dubbed “ATTN/11 - Paper Tape Is All You Need,” is a single-layer, single-head transformer written in raw PDP-11 assembly language by Damian Bourré. Its modest goal is to learn how to reverse a sequence of eight digits (e.g., 12345678 to 87654321). This seemingly simple task is non-trivial because the model cannot merely memorize patterns; it must learn a structural rule based on position, not content. Dave explains the concept of self-attention using an analogy of resolving ambiguous words like “bank” in a sentence (“Mary went down to the bank to get some cash”). Transformers, he clarifies, dynamically weigh different parts of the input to resolve meaning, a capability that revolutionized natural language processing by allowing models to understand relationships between distant tokens.
The training process is likened to “training a dog”: the machine makes a guess, measures how wrong it was (loss), nudges a pile of numbers (weights) in the right direction, and repeats. This is backpropagation, the clever part of modern AI. The PDP-11 transformer’s architecture is remarkably lean, featuring just 1 layer, 1 head, 16 model dimensions, 8 sequence length, a 10-digit vocabulary, and only 1,216 parameters. To achieve reasonable performance on the vintage hardware, the arithmetic was custom-tailored using fixed-point representation. What took hours in Fortran, after being rewritten in assembly, managed to converge to 100% accuracy in about 3.5 minutes on the 44.
Ultimately, the video concludes that AI training, at its core, is a process of repeated error correction on adjustable numbers in memory—a brute-force optimization computers have always excelled at. This demystifies the “magic” of AI, highlighting that the underlying mathematics are frugal, and the intelligence emerges from countless, tiny adjustments. The project underscores the importance of efficiency and creative engineering under hardware constraints, which are becoming increasingly relevant even in the modern AI landscape. It reminds us that a computer is a machine with specific strengths and weaknesses, not a wish-granting device, and that understanding these fundamental realities can lead to profound insights and innovative solutions.
Related Concepts
- Transformer training — Wikipedia
- Neural network training — Wikipedia
- Transformer architecture — Wikipedia
- Computational scaling — Wikipedia
- CPU — Wikipedia
- RAM — Wikipedia
- Legacy computing — Wikipedia
- Self-attention — Wikipedia
- Backpropagation — Wikipedia
- Natural language processing — Wikipedia
- Fixed-point representation — Wikipedia
- Assembly language — Wikipedia
- Loss function — Wikipedia
- Model parameters — Wikipedia
- Optimization — Wikipedia
- Error correction — Wikipedia
- Weights — Wikipedia
- Model dimensions — Wikipedia
- Sequence length — Wikipedia
- Vocabulary — Wikipedia
- Brute-force optimization — Wikipedia
- Single-layer transformer — Wikipedia
- Single-head attention — Wikipedia