4bit quantisation
A technique reducing numerical precision in machine learning models to 4 bits per parameter, significantly lowering memory footprint and computational costs while maintaining model performance.
- Julia Turc’s video discusses the evolution of training LLMs with reduced precision, particularly the shift toward 4-bit floating-point (FP4) training for cost efficiency
- Training LLMs incurs extreme costs: Stanford estimated Google’s Gemini Ultra (2023) at ~78 million (Sam Altman claimed higher), with 2025 costs expected to rise further
- Enables training and inference with reduced hardware requirements compared to full-precision (32-bit) models
- Addresses key challenge: maintaining model accuracy during precision reduction through advanced quantisation algorithms
2026 04 14 How does 4bit quantisation work