Model Size

The physical and computational footprint of a machine learning model, primarily determined by the number of parameters (e.g., 7B, 70B) and their precision (e.g., 32-bit, 8-bit). Larger models require more storage, memory, and computational resources for training and inference.

Key implications:

  • Storage: Full-precision 70B models (e.g., NVIDIA’s Llama 3.1 Nemotron 70B) require ~150GB (30 files × 5GB each).
  • Hardware demands: High memory bandwidth and VRAM needed for inference, limiting deployment on consumer hardware.
  • Trade-offs: Larger size often correlates with better performance but increases latency and cost.

Quantisation as optimization technique:

  • model-efficiency reduces parameter precision (e.g., 32-bit → 8-bit), shrinking storage needs by ~75% (e.g., 70B model → ~30GB).
  • Enables deployment of large models on edge devices and reduces inference latency.
  • Source: Adam Lucek - quantisation of LLM (2026-04-14 video).

2026 04 14 Adam Lucek quantisation of LLM

Source Notes

Source Notes