Quantisation

Quantisation is a technique in machine learning that reduces the precision of numerical values used in neural networks, typically by representing weights and activations with fewer bits than standard floating-point formats. In AI agents, quantisation enables models to run more efficiently on resource-constrained hardware by decreasing memory requirements and computational overhead while maintaining acceptable performance levels. Rather than storing weights as 32-bit floating-point numbers, quantised models use lower-precision representations, such as 8-bit or 4-bit integers.

4-bit Quantisation

4-bit quantisation represents model weights and activations using only 4 bits per value, enabling substantial compression compared to standard 32-bit formats. This approach reduces memory footprint by approximately 8 times and can significantly accelerate inference speed, making large models feasible for deployment on edge devices and mobile hardware. 4-bit quantisation often involves mapping the original range of floating-point values to a smaller integer range, then applying inverse scaling during computation to preserve numerical accuracy.

Trade-offs and Practical Considerations

The primary trade-off in 4-bit quantisation is between model compression and accuracy loss. While aggressive quantisation can degrade performance, careful implementation—such as post-training quantisation or quantisation-aware training—can minimise accuracy degradation to acceptable levels. The effectiveness of 4-bit quantisation depends on model architecture, the specific layers being quantised, and the downstream task. For many AI agent applications, the reduced latency and memory footprint justify modest performance reductions.