- “vram”
- “gpu”
- “machine-learning”
- “quantization”
- “llm”
- “video-ram”
- “gpu-memory”
- “model-compression”
- “llm-inference” updated: 2026-04-14 group: open-systems-local-models aliases:
- “Video RAM”
- “GPU memory” summary: “VRAM is the dedicated memory on a GPU used to store model weights, activations, and intermediate data during inference and training.” backlinks:
- 2026 04 14 Best small LLM for local inference for instruction following
VRAM
Video RAM (VRAM) is the dedicated memory on a GPU used to store model weights, activations, and intermediate data during inference and training. Its capacity directly limits the size of models that can be executed on a single GPU, especially for resource-intensive tasks like large-language-model (LLM) deployment.
- VRAM Constraints in LLMs: Full-precision (32-bit) LLMs like NVIDIA’s Llama 3.1 Nemotron 70B (70.6 billion parameters) require ~30GB+ of VRAM (e.g., 30+ files at ~5GB each), exceeding most consumer GPUs.
- Quantization as a VRAM Optimization: model-efficiency reduces model parameter precision (e.g., to 8-bit or 4-bit), slashing VRAM requirements by 2–4× while maintaining acceptable accuracy. This enables deployment of large models on hardware with limited VRAM.
- Reference: [Adam Lucek - quantization]
- Small LLMs for Local Inference: For running well-instructed small LLMs on a 48GB VRAM NVIDIA GPU, quantized versions of Llama 3.1 70B, Gemma 2 27B, Qwen 2 72B, and Mistral Large are viable options. These models, when properly quantized, can effectively run on a 48GB VRAM GPU.