New Qwen agentic local llm
https://www.youtube.com/watch?v=IaqzrByS8yA
This video provides a comprehensive guide to installing and testing the Qwen3-Coder-Flash model locally, with a special focus on its agentic coding and tool use capabilities. The presenter, Fahd Mirza, walks through the entire process, from setting up the environment to demonstrating the model’s advanced functionalities.
Qwen3-Coder-Flash Model Overview
The video begins with an introduction to the Qwen3-Coder-Flash model. Mirza mentions that he has previously covered the model’s architecture and benchmarks in detail in other videos, so he provides a brief overview. He highlights that, in his opinion, the Qwen3-Coder model is one of the best open-source, open-weight models available, particularly in the sub-30 billion parameter size range. He also emphasizes the model’s improved tool use functionality, which is the primary focus of this video.
Local Installation and Setup
To install the model locally, Mirza uses vLLM, a fast, low-latency inference engine. He provides a step-by-step guide on how to get vLLM installed and running. For the graphical user interface (GUI), he opts for Open WebUI but notes that users can choose any interface they prefer. The installation is performed on an Ubuntu system with an NVIDIA H100 GPU with 80GB of VRAM. The video shows the process of downloading the model, which consists of 16 shards, each 4GB in size. The model loading takes just under 90 seconds and consumes around 57GB of VRAM.
Testing and Demonstration
The video includes two main testing phases:
- Coding Problem: Mirza first tests the model’s coding capabilities by asking it to create a self-contained HTML page featuring an animation of a neural network. The model successfully generates the code, which, when opened in a browser, displays an animation of a “Living Neural Mind.”
- Agentic Coding and Tool Use: The second and more in-depth test focuses on the model’s agentic coding abilities. Mirza explains that agentic coding allows the AI model to act as an intelligent agent that can autonomously use tool functions to solve problems. To enable this, the model needs to be served with the
enable-auto-tool-choiceandtool-call-parser hermesflags in vLLM.
He demonstrates this with a practical example where he asks the model to compare the weather comfort between New York and London for an upcoming trip. The model is given access to two functions: get_weather and calculate_comfort_index. The model intelligently understands the user’s request, identifies the need to use these functions, and creates the appropriate function calls with the correct parameters (city names). The code then executes these function calls and returns the results to the user.
Sponsors and Additional Information
The video also features a sponsorship from Eigengent AI, a multi-agent workforce platform, and Massed Compute, a cloud computing service that offers affordable GPU and VM rentals. Mirza provides a discount code for Massed Compute in the video description. In conclusion, the video serves as a valuable resource for anyone interested in exploring the capabilities of the Qwen3-Coder-Flash model, particularly its advanced agentic coding features. It provides a clear and practical demonstration of how to install, set up, and test the model locally, making it accessible to a wide range of users.