Local LLM serving
The practice of deploying large-language-models on local, private hardware rather than through cloud-based APIs. Primary drivers include ai-security, reduced Latency, and Offline Capability.
Core Technologies
- Inference Engines:
- vLLM: High-throughput serving engine utilizing PagedAttention for efficient memory management.
- llamacpp: Optimized for local deployment across various hardware backends via model-efficiency.
- ollama: Simplified orchestration for running models locally.
- Model Architectures:
- SmollLM family: Lightweight, high-performance models designed for efficient edge computing.
- llama: Industry-standard open-weights.
Technical Fundamentals
- Execution Complexity: LLMs are not simple executable files; inference requires complex loading processes and management of model weights.
- Optimization Drivers: Efficient performance relies heavily on memory mapping and performance optimization during the loading and execution phases.
Related Notes
- 2026 04 22 LLM Inference Engines Memory Mapping and Performance Optimization