Generated: 2026-05-26 · API: Gemini 2.5 Flash · Modes: Summary
Tiiny AI Pocket Lab: Running Large Language Models Locally and Privately
Clip title: This Shouldn’t Be Able to Run 120B Locally Author / channel: Alex Ziskind URL: https://www.youtube.com/watch?v=RkzCAaIV_cQ
Summary
The video introduces the Tiiny AI Pocket Lab, a compact device designed to run large language models (LLMs) locally and privately, challenging the traditional need for extensive, expensive GPU hardware. The presenter highlights the growing trend of bigger GPUs and servers for AI, then dramatically reveals this pocket-sized device capable of handling models up to 120 billion parameters, a claim he sets out to verify.
The Tiiny AI Pocket Lab is surprisingly powerful for its size, weighing just 305 grams. It features an ARM v9.2 CPU with a Neural Processing Unit (NPU) boasting 30 INT8 TOPS, an impressive 80GB of LPDDR5X memory, and 1TB PCIe 4.0 SSD storage. This allows it to directly store and process large models, rather than relying on the host computer’s limited resources. The device connects to a host computer (a MacBook Neo with only 8GB RAM is used in the demo) via USB-C, where its TiinyOS software provides a user-friendly interface. While the host MacBook could only run a 4-billion parameter model at 9 tokens/second, the Tiiny device successfully ran the GPT-OSS-120B model (which typically requires 60-80GB VRAM) locally, achieving a decoding speed of 18.86 tokens/second without stressing the MacBook’s memory.
Beyond simple chat, TiinyOS offers an “Agent Store” with various pre-built AI applications like ChatMemo (AI assistant), Presenton, RAGFlow, SD Web UI (for Stable Diffusion), and TiinyBot. It also provides an SDK and command-line interface, enabling developers to integrate and interact with models programmatically in Python or directly from the terminal, making it highly versatile for software development. Models are downloaded directly to the Tiiny device via Wi-Fi (initial internet connection required) and then run completely offline, ensuring privacy. The dashboard tracks token usage, which is valuable for developers to estimate costs if deploying solutions to cloud-based services later. The device handles various model types, including coding models (like Qwen3-Coder-30B, integrated into VS Code) and text-to-image models, although resources are managed by loading/unloading models as needed.
The underlying technology, PowerInfer, found on GitHub, is a CPU/GPU LLM inference engine that intelligently manages model activation, keeping frequently used parts “hot” and less common ones “asleep” to optimize performance and low power consumption. Although the Tiiny AI Pocket Lab is not intended to replace high-end GPU rigs, its ability to bring powerful, private, and local AI capabilities to less capable laptops or mini PCs makes it a compelling solution for developers and users seeking portable, on-device AI. Currently available through a Kickstarter campaign, it presents a significant step towards democratizing access to large language models for personal and mobile use.
Video Description & Links
Description
I paired a tiny AI box with the MacBook Neo—and it seriously changed what I thought was possible with local AI. Tiiny box: https://tiiny.ai
👀 My favorite external drive (dependable): https://amzn.to/3Os9Wi3 👀 Thunderbolt 4 dock: https://amzn.to/3yVRicC
⚡ Other gear I use: https://www.amazon.com/shop/alexziskind
🎥 Related Videos 🎥 🧬🐍 Mac Studio CLUSTER vs M3 Ultra 🤯 - https://youtu.be/d8yS-2OyJhw 🧳🧰 Mini PC portable setup - https://youtu.be/4RYmsrarOSw 🍎💻 Dev setup on Mac - https://youtu.be/KiKUN4i1SeU 💸🧠 Cheap mini runs a 70B LLM 🤯 - https://youtu.be/xyKEQjUzfAk 🧪🔥 RAM torture test on Mac - https://youtu.be/l3zIwPgan7M 🍏⚡ FREE Local LLMs on Apple Silicon | FAST! - https://youtu.be/bp2eev21Qfo 🧠📉 REALITY vs Apple’s Memory Claims | vs RTX4090m - https://youtu.be/fdvzQAWXU7A ⚡💥 Thunderbolt 5 BREAKS Apple’s Upcharge - https://youtu.be/nHqrvxcRc7o 🧠🚀 INSANE Machine Learning on Neural Engine - https://youtu.be/Y2FOUg_jo7k 🧱🖥️ Mac Mini Cluster - https://youtu.be/GBR6pHZ68Ho
- 🛠️ Developer productivity Playlist - https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX
— — — — — — — — —
❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺
Click here to subscribe: https://www.youtube.com/@AZisk?sub_confirmation=1
Join this channel to get access to perks: https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join
— — — — — — — — —
📱LET’S CONNECT ON SOCIAL MEDIA
ALEX ON TWITTER: https://twitter.com/digitalix
— — — — — — — — —
Tags
software developer, programmer, software development, programming, developer, developer tests, m3 chip, machine learning, llm, m3max, m3 machine learning, m3 ai, webui, openui, open webui, local ai, local chatgpt, chatgpt, ipx, gmktec, nuc, beelink, mini pc, m4 pro, mac mini, apple, apple mini, mini, m4 mini, m3 ultra, mac studio, gtr9, gtr9 pro, strix halo, ryzen, Al Max+ 395, ollama, comfy ui, tiiny, tiiny ai, tiiny pocket, tiiny pocket lab, pocket lab, macbook, macbook neo
URLs
- https://tiiny.ai
- https://amzn.to/3Os9Wi3
- https://amzn.to/3yVRicC
- https://www.amazon.com/shop/alexziskind
- https://youtu.be/d8yS-2OyJhw
- https://youtu.be/4RYmsrarOSw
- https://youtu.be/KiKUN4i1SeU
- https://youtu.be/xyKEQjUzfAk
- https://youtu.be/l3zIwPgan7M
- https://youtu.be/bp2eev21Qfo
- https://youtu.be/fdvzQAWXU7A
- https://youtu.be/nHqrvxcRc7o
- https://youtu.be/Y2FOUg_jo7k
- https://youtu.be/GBR6pHZ68Ho
- https://www.youtube.com/playlist?list=PLPwbI_iIX3aQCRdFGM7j4TY_7STfv2aXX
- https://www.youtube.com/@AZisk?sub_confirmation=1
- https://www.youtube.com/channel/UCajiMK_CY9icRhLepS8_3ug/join
- https://twitter.com/digitalix
YouTube Playlist URLs
Related Concepts
- Tiiny AI Pocket Lab — Wikipedia
- Large Language Models (LLMs) — Wikipedia
- Local and Private Computing — Wikipedia
- GPU Hardware — Wikipedia
- Local LLM Inference — Wikipedia
- Neural Processing Unit (NPU) — Wikipedia
- LPDDR5X Memory — Wikipedia
- PowerInfer Engine — Wikipedia
- Retrieval-Augmented Generation (RAG) — Wikipedia
- Stable Diffusion — Wikipedia
- Offload Computing — Wikipedia
- ARM v9.2 Architecture — Wikipedia
- Privacy-Preserving AI — Wikipedia
- Model Quantization (INT8) — Wikipedia
- Edge AI Hardware — Wikipedia
- SDK Development — Wikipedia
- Token Decoding Speed — Wikipedia
- Offline Processing — Wikipedia
- AI Agent Store — Wikipedia
- USB-C Interface — Wikipedia