Model Weights
Parameters learned during training that define a model’s behavior. Stored in files (e.g., .bin, .pt) and loaded for inference. Size directly impacts computational requirements (e.g., 20B parameters ≈ 40GB storage).
Key Characteristics
- Open-weight models (e.g.,
[[concepts/gpt-oss-20b|gpt-oss-20b]]) publicly share weights while keeping training code proprietary - Local deployment requires downloading weights (e.g., via Hugging Face Hub)
- Inference executes using weights without cloud dependency; loading/running involves inference engines, memory-mapping, and performance optimization rather than simple file execution (see Technical Overview of LLM Inference: Loading, Memory, and Quantization)
- Quantization reduces weight precision to decrease memory footprint and accelerate loading, enabling deployment of larger models on hardware with limited resources
- Memory management during weight loading requires specialized tensor allocation and mapping strategies to ensure efficient hardware utilization and minimize latency