Jeredblu running LLM locally
https://www.youtube.com/watch?v=Ar0Or9U0pCs
This video by [Speaker’s Name, if identifiable, otherwise “the speaker”] provides an in-depth look at OpenAI’s recently released open-weight language model, gpt-oss (specifically the gpt-oss-20b variant).
Here’s a detailed summary of the video’s content:
1. Introduction to **gpt-oss** (0:00)
- OpenAI released
gpt-oss-120bandgpt-oss-20b, which are “open-weight” language models, meaning their weights are publicly available, though not necessarily the full source code for training (the speaker acknowledges this distinction). - The models are designed to deliver “strong real-world performance at low cost” and are available under the Apache 2.0 license.
- Key Benefit: Users can run these models for free, locally on their computers, without an internet connection, and crucially, without hitting API rate limits.
2. Motivation for Local Models & MCP Servers (0:28)
- The speaker highlights his personal reliance on cloud-based LLMs like Claude for productivity tasks involving integrations with services like Notion, Gmail, and Bright Data (which he collectively refers to as “MCP servers”).
- He sees local open-weight models like
gpt-ossas a way to offload simpler, mundane tasks, preserving cloud usage for more complex or compute-intensive operations. - Being able to run models privately also offers significant privacy advantages, as data isn’t sent to external servers unless explicitly desired.
3. Understanding Limitations: Performance & Context Window (0:47)
- Performance:
gpt-ossis not equivalent to larger, more powerful frontier models like GPT-4o or Claude Sonnet. Local machines have limited computing power compared to cloud infrastructure. - Context Window (Crucial Point): This is the primary technical limitation for local models. The model’s inherent context window support might be very large (e.g.,
gpt-oss-20bsupports up to 131,072 tokens). However, the computer’s available memory dictates the actual usable context window. Impact of Tools: Every tool enabled (even just its description) consumes tokens in the context window from the beginning of a chat. When tool calls are made, they consume even more tokens, rapidly filling up the available context. This is a common issue with local open-source models, especially when using many tools.
4. Running **gpt-oss** Locally on Mac (2:29)
- The speaker notes two primary ways to run open-source models on Mac: Ollama and LM Studio. Both are applications that download and serve models locally.
- Switch to LM Studio (2:51): While he used Ollama for two years, he switched to LM Studio specifically because it offered better dynamic control over the context window and improved MCP server integration.
- LM Studio Setup (3:15): Download and install LM Studio from
lmstudio.ai. Navigate to the “Discover” tab to search for and download models (e.g.,openai/gpt-oss-20b). Manual Model Load Parameters (3:58): The speaker strongly recommends toggling on “Manually choose model load parameters” when loading a model. Ollama typically defaults to a 2000-token context window. LM Studio defaults to 4096 tokens. Users can manually adjust this based on their computer’s unified memory. His MacBook Pro M4 with 36GB unified memory can comfortably run at 32768 tokens. MCP Server Integration (4:10): LM Studio allows users to install standardmcp.jsoncompatible plugins (likemcp/bright-data,mcp/basic-memory,mcp/context7,mcp/sequential-thinking). Users can easily toggle individual tools within these plugins on or off to manage context consumption.
5. Demonstration and Insights (5:07)
- Initial Test (4096-token context): A simple “Hey tell me about yourself” prompt fills 11% of the context window. A complex prompt asking the
bright-dataMCP to scrape a website quickly causes the context to jump to 74.8% full just from the tool descriptions and initial thoughts. The tool then fails, entering a “failure loop” and blowing the context window out to1434.8%full. This vividly illustrates the rapid context consumption when tools are involved. - Adjusted Test (32768-token context): After increasing the context length, the same complex prompt only fills 29.2% of the context, and the tool successfully scrapes the requested information. This demonstrates the importance of optimizing the context window for local hardware.
6. Conclusion and Future Outlook (7:25)
- Benefits: Running
gpt-osslocally is free, private, and independent of internet access. - User Responsibility: Users need to adjust their usage of context and MCPs, and understand the memory limitations of their hardware.
- Optimization: The speaker expects continuous optimization of these models and platforms to improve performance and reduce memory footprint.
- Strategic Usage: He plans to offload simpler cloud-based MCP tasks to his local
gpt-osssetup, saving his paid cloud API usage for more intensive tasks. - Safety/Guardrails (8:16): OpenAI also launched a “red-teaming” challenge for
gpt-oss. In his testing, the speaker found the model to be “quite prudent” and unwilling to perform certain actions (e.g., LinkedIn scraping) due to its safety protocols, even when direct tool calls were attempted. This shows that despite being open-weight, it retains some of OpenAI’s safety measures. - Recommendation: He recommends LM Studio as the best platform for testing these models and encourages users to experiment with their computer’s specifications to find the optimal model and context window settings.