Core Library
The Core Library refers to the foundational software stack enabling local Large Language Model (LLM) inference, management, and orchestration. It encompasses inference engines, model formats, and routing logic required to deploy AI workloads offline or on-premise.
Inference Engines & Runtimes
- llama.cpp: The primary C/C++ implementation for efficient GGUF model inference. Supports CPU and GPU offloading.
- Router Mode: A recent addition allowing native, hot-swappable switching between multiple loaded LLMs without restarting the server. This feature abstracts model management, enabling instant context switching for different tasks (e.g., coding vs. creative writing).
- See detailed analysis: llama.cpp Router Mode: Native Hot-Swappable-Local-LLM-Switching
Model Formats
- GGUF: The standard format for
llama.cppmodels, supporting metadata, tensor splits, and quantization schemes. - GGML: Older tensor format, largely superseded by GGUF.
Architecture & Components
- Server: HTTP/REST API interface for interacting with the running model(s).
- Backend: Handles the actual computation (CPU threads, CUDA/Vulkan/Metal GPU acceleration).
- Frontend/UI: Interfaces like open-webui or text-generation-webui that consume the API.
Key Concepts
- Quantization: Reducing model precision (e.g., Q4_K_M, Q8_0) to fit within VRAM/RAM constraints.
- Context Window: The maximum sequence length the model can process, often limited by RAM availability.
- KV Cache: Stores key-value pairs of processed tokens to speed up subsequent generation steps.
Integration Notes
- Ensure the Core Library is updated regularly to leverage performance improvements and new features like Router Mode.
- Cross-reference with Local LLM Deployment Strategy for hardware requirements.