Core Library

The Core Library refers to the foundational software stack enabling local Large Language Model (LLM) inference, management, and orchestration. It encompasses inference engines, model formats, and routing logic required to deploy AI workloads offline or on-premise.

Inference Engines & Runtimes

Model Formats

  • GGUF: The standard format for llama.cpp models, supporting metadata, tensor splits, and quantization schemes.
  • GGML: Older tensor format, largely superseded by GGUF.

Architecture & Components

  • Server: HTTP/REST API interface for interacting with the running model(s).
  • Backend: Handles the actual computation (CPU threads, CUDA/Vulkan/Metal GPU acceleration).
  • Frontend/UI: Interfaces like open-webui or text-generation-webui that consume the API.

Key Concepts

  • Quantization: Reducing model precision (e.g., Q4_K_M, Q8_0) to fit within VRAM/RAM constraints.
  • Context Window: The maximum sequence length the model can process, often limited by RAM availability.
  • KV Cache: Stores key-value pairs of processed tokens to speed up subsequent generation steps.

Integration Notes