Matthew Berman https://www.youtube.com/watch?v=9t-BAjzBWj8 Here is a detailed summary of the video tutorial on setting up and running local Reinforcement Learning (RL) using Nvidia and Unsloth.
Tutorial: Running Reinforcement Learning Locally to Master 2048
Presenter: Matthew Berman (in partnership with Nvidia) Goal: To teach an AI model (GPT-OSS) to master the game 2048 using Reinforcement Learning on a home gaming PC.
1. Introduction to the Concept
- The Power of RL: Reinforcement Learning is the technology behind AI surpassing humans in Chess, Go, League of Legends, and autonomous driving.
- The Shift: Previously, RL required massive, expensive compute clusters. Thanks to optimizations (Unsloth) and consumer hardware (Nvidia RTX GPUs), this can now be done locally.
- Specific Technique: The tutorial uses Reinforcement Learning with Verifiable Rewards.
2. Prerequisites & Hardware
- Hardware: An Nvidia RTX GPU.
- Note: The presenter uses an RTX 5090, but emphasizes that any recent Nvidia architecture will work (though speeds may vary).
- Software Model: GPT-OSS (OpenAI’s open-source model).
- Optimization Library: Unsloth (An open-source library that optimizes fine-tuning and training speed).
- Operating System: Windows, utilizing WSL (Windows Subsystem for Linux).
3. Installation Guide (Step-by-Step)
The presenter recommends using WSL (Ubuntu) as the most straightforward installation method.
A. Drivers and System Setup
- Update Drivers: Ensure Nvidia GPU drivers are up to date via the Nvidia app or website.
- Install CUDA Toolkit: Download and install the CUDA Toolkit 13.1 (Linux/WSL version) from the Nvidia website.
- Install WSL (PowerShell):
- Open PowerShell and run:
wsl.exe --install --distribution Ubuntu-24.04 - Load Ubuntu:
wsl.exe -d Ubuntu-24.04
- Open PowerShell and run:
- Verify GPU Connection:
- Inside the Linux terminal, run
nvidia-smito confirm the GPU is recognized.
- Inside the Linux terminal, run
B. Python Environment Setup
- Update Packages:
sudo apt update - Install Python & Pip: Run the command to install Python 3, pip, and venv (virtual environment tools).
- Command:
sudo apt install python3 python3-pip python3-venv -y
- Command:
- Create Virtual Environment:
- Create:
python3 -m venv unsloth_env_rl - Activate:
source unsloth_env_rl/bin/activate
- Create:
C. Install Libraries
- Install Torch: Install PyTorch and Torchvision.
- Install Unsloth:
pip install unsloth - Install Jupyter:
pip install jupyter - Launch: Run
jupyter notebookto start the local server.
4. The Reinforcement Learning Process (Jupyter Notebook)
The presenter downloads a specific notebook from the Unsloth website titled “GPT-OSS (20B) - Auto win 2048 game.”
A. Loading the Model
- The notebook uses
FastLanguageModelfrom Unsloth to loadgpt-oss-20b. - LoRA (Low-Rank Adaptation): This technique is used to make training efficient. It adds only 1-5% extra weights to the model for fine-tuning, reducing memory usage by over 60% while retaining accuracy.
B. The Environment (The Game)
- The notebook contains Python code for the game 2048 (noted as being written by GPT-5).
- Testing: The presenter runs code blocks to verify the game logic works (moving tiles with W, A, S, D keys).
C. The RL Strategy
- The Prompt: The model is prompted to “Create a new short 2048 strategy using only native Python code” based on the current board state.
- The Loop:
- The Model generates a Python function (a strategy).
- The System extracts and executes that code against the game.
- Reward Functions:
- Function Works: Did the model write valid Python?
- No Cheating: Did the model try to manipulate the board illegally?
- Strategy Succeeds: Did the strategy actually win the game or score points?
D. Training (GRPO)
- The notebook uses GRPO (Group Relative Policy Optimization).
- The model runs through iterations (approx. 1,000 steps set, though fewer are needed).
- Feedback Loop: If a strategy yields a high score/win, the model is rewarded. If it fails (syntax error or game over), it is penalized (e.g., Reward score: -1).
5. Results
- Before Training: The model produces generic strategies (e.g., “Always move left”) which fail immediately or time out.
- During Training:
- The GPU ramps up (inference running at approx. 60% load).
- The model iterates through failures.
- After Training (84 Iterations):
- The reward score jumps to 10.5.
- The game output shows the board achieving the 2048 tile.
- The model has successfully learned a Python coding strategy to solve the game based on board states.
6. Conclusion & Takeaways
- Time Commitment: The entire setup and training took approximately 6 hours.
- Significance:
- This demonstrates that advanced model training is no longer exclusive to massive tech labs.
- Users can “Reinforcement Learn” models for custom tasks (financial analysis, personalized assistants, complex gaming) entirely offline.
- Privacy & Control: Running this locally ensures data privacy and allows for high levels of customization.
- Resources: All code, links, and commands are provided in the video description/documentation.
Related Concepts
- Reinforcement Learning — Wikipedia
- Nvidia RTX GPUs — Wikipedia
- Unsloth — Wikipedia
- Local RL — Wikipedia
- GPT-OSS — Wikipedia
- 2048 game — Wikipedia