• “vram”
    • “gpu”
    • machine-learning
    • “quantization”
    • “llm”
    • “video-ram”
    • “gpu-memory
    • model-compression
    • “llm-inference” updated: 2026-04-14 group: open-systems-local-models aliases:
    • “Video RAM”
    • “GPU memory” summary: “VRAM is the dedicated memory on a GPU used to store model weights, activations, and intermediate data during inference and training.” backlinks:
    • 2026 04 14 Best small LLM for local inference for instruction following

VRAM

Video RAM (VRAM) is the dedicated memory on a GPU used to store model weights, activations, and intermediate data during inference and training. Its capacity directly limits the size of models that can be executed on a single GPU, especially for resource-intensive tasks like large-language-model (LLM) deployment.

  • VRAM Constraints in LLMs: Full-precision (32-bit) LLMs like NVIDIA’s Llama 3.1 Nemotron 70B (70.6 billion parameters) require ~30GB+ of VRAM (e.g., 30+ files at ~5GB each), exceeding most consumer GPUs.
  • Quantization as a VRAM Optimization: model-efficiency reduces model parameter precision (e.g., to 8-bit or 4-bit), slashing VRAM requirements by 2–4× while maintaining acceptable accuracy. This enables deployment of large models on hardware with limited VRAM.
  • Small LLMs for Local Inference: For running well-instructed small LLMs on a 48GB VRAM NVIDIA GPU, quantized versions of Llama 3.1 70B, Gemma 2 27B, Qwen 2 72B, and Mistral Large are viable options. These models, when properly quantized, can effectively run on a 48GB VRAM GPU.