4bit quantisation

A technique reducing numerical precision in machine learning models to 4 bits per parameter, significantly lowering memory footprint and computational costs while maintaining model performance.

  • Julia Turc’s video discusses the evolution of training LLMs with reduced precision, particularly the shift toward 4-bit floating-point (FP4) training for cost efficiency
  • Training LLMs incurs extreme costs: Stanford estimated Google’s Gemini Ultra (2023) at ~78 million (Sam Altman claimed higher), with 2025 costs expected to rise further
  • Enables training and inference with reduced hardware requirements compared to full-precision (32-bit) models
  • Addresses key challenge: maintaining model accuracy during precision reduction through advanced quantisation algorithms

2026 04 14 How does 4bit quantisation work