VRAM

Video RAM (VRAM) is the dedicated memory on a GPU used to store model weights, activations, and intermediate data during inference and training. Its capacity directly limits the size of models that can be executed on a single GPU, especially for resource-intensive tasks like large-language-model (LLM) deployment.

  • VRAM Constraints in LLMs: Full-precision (32-bit) LLMs like NVIDIA’s Llama 3.1 Nemotron 70B (70.6 billion parameters) require ~30GB+ of VRAM (e.g., 30+ files at ~5GB each), exceeding most consumer GPUs.
  • Quantization as a VRAM Optimization: model-efficiency reduces model parameter precision (e.g., to 8-bit or 4-bit), slashing VRAM requirements by 2–4× while maintaining acceptable accuracy. This enables deployment of large models on hardware with limited VRAM.
  • Small LLMs for Local Inference: For running well-instructed small LLMs on a 48GB VRAM NVIDIA GPU, quantized versions of Llama 3.1 70B, Gemma 2 27B, Qwen 2 72B, and Mistral Large are viable options. These models, when properly quantized, can effectively run on a 48GB VRAM GPU.