🗂️ AI & Agents · View mindmap

Core Library

The Core Library refers to the foundational software stack enabling local Large Language Model (LLM) inference, management, and orchestration. It encompasses inference engines, model formats, and routing logic required to deploy AI workloads offline or on-premise.

Inference Engines & Runtimes

llama.cpp: The primary C/C++ implementation for efficient GGUF model inference. Supports CPU and GPU offloading.
- Router Mode: A recent addition allowing native, hot-swappable switching between multiple loaded LLMs without restarting the server. This feature abstracts model management, enabling instant context switching for different tasks (e.g., coding vs. creative writing).
- See detailed analysis: llama.cpp Router Mode: Native Hot-Swappable-Local-LLM-Switching

Model Formats

GGUF: The standard format for llama.cpp models, supporting metadata, tensor splits, and quantization schemes.
GGML: Older tensor format, largely superseded by GGUF.

Architecture & Components

Server: HTTP/REST API interface for interacting with the running model(s).
Backend: Handles the actual computation (CPU threads, CUDA/Vulkan/Metal GPU acceleration).
Frontend/UI: Interfaces like open-webui or text-generation-webui that consume the API.

Key Concepts

Quantization: Reducing model precision (e.g., Q4_K_M, Q8_0) to fit within VRAM/RAM constraints.
Context Window: The maximum sequence length the model can process, often limited by RAM availability.
KV Cache: Stores key-value pairs of processed tokens to speed up subsequent generation steps.

Integration Notes

Ensure the Core Library is updated regularly to leverage performance improvements and new features like Router Mode.
Cross-reference with Local LLM Deployment Strategy for hardware requirements.

NemoClaw Knowledge Wiki

Explorer

core-library

Core Library

Inference Engines & Runtimes

Model Formats

Architecture & Components

Key Concepts

Integration Notes

Graph View

Table of Contents

Backlinks