https://www.youtube.com/watch?v=pTaSDVz0gok This video provides a comprehensive guide on fine-tuning Large Language Models (LLMs) using Python, specifically tailored for deployment with Ollama. 1. What is Fine-Tuning? Fine-tuning involves taking a pre-trained language model and adapting it to excel at a more specific task.
- Analogy: It’s akin to training an experienced chef on your restaurant’s particular recipes rather than teaching someone to cook from zero.
- How it works: Instead of training from scratch, you start with a powerful pre-trained model (like ChatGPT or Claude) that already understands human language. You then feed it examples of your specific use case (e.g., customer service conversations, legal documents, medical records). The model adjusts its existing knowledge to perform better in that specific domain.
- Fine-Tuning vs. Parameter Tuning: Parameter Tuning: Adjusting settings like temperature or “Top_K” to change how the model performs (like adjusting a car radio). Fine-Tuning: Teaching the model new, specific skills (like teaching a car to drive in a completely different neighborhood).
2. When to Fine-Tune: Fine-tuning is beneficial in three main scenarios:
- Consistent Formatting or Style: When prompting alone can’t achieve the desired consistent output (e.g., specific JSON models, particular writing styles).
- Domain-Specific Data: When you have a lot of data unique to your domain that the pre-trained model hasn’t naturally encountered (e.g., advanced medical records, internal customer service logs).
- Cost Reduction: Using a smaller, specialized fine-tuned model can be more cost-effective than relying on a massive, general-purpose LLM for specific tasks.
- Key Advantage: Requires significantly less data and compute power compared to training from scratch (thousands of examples and minutes/hours of training vs. millions of examples and months).
- Caveat: Fine-tuning can make models worse at general tasks while improving their specialized capabilities.
3. Practical Steps: Fine-Tuning with Unsloth in Google Colaboratory A. Gathering Data
- Importance: High-quality data is paramount for effective fine-tuning.
- Format: The video uses a JSON dataset where each entry has an “input” (e.g., HTML snippet) and an “output” (e.g., structured JSON extracted product information). This format is flexible for various tasks.
- Tool: The video uses
Unsloth, an open-source library known for its speed in fine-tuning LLMs.
B. Setting up the Environment (Google Colab)
- Why Colab: Recommended for its free access to high-end GPUs (like T4 GPUs) for model training, which is crucial as local training can be very time-consuming without powerful hardware.
- Steps: Connect to Runtime: Open the provided Colab notebook, connect to a T4 GPU runtime. Upload Data: Upload your
json_extraction_dataset_500.jsonfile to the Colab environment. Install Dependencies: Run!pip install unsloth trl peft accelerate bitsandbytesto install the necessary libraries. Restart the session if prompted.
C. Fine-Tuning the Model
- Load Model & Tokenizer: Choose a base model (e.g.,
"unsloth/Phi-3-mini-4k-instruct-bnb-4bit"). Load the model and its tokenizer usingFastLanguageModel.from_pretrained. - Preprocess Data: Define a
format_promptfunction to structure your input and output into a single string (e.g.,### Input: {input}\n### Output: {json_output}<|endofttext|>). Apply this function to your dataset to createformatted_data, and then convert it into aDatasetobject usingDataset.from_dict({"text": formatted_data}). - Add LoRA Adapters: Apply LoRA (Low-Rank Adaptation) adapters to the model using
FastLanguageModel.get_peft_model(). This allows for efficient fine-tuning by adding small, trainable layers. The video setsr=64(LoRA rank for capacity/memory tradeoff) andlora_alpha=128(scaling factor). - Train the Model: Initialize the
SFTTrainerfromtrl, passing the model, tokenizer, and training dataset. ConfigureTrainingArguments(e.g.,per_device_train_batch_size,gradient_accumulation_steps,num_train_epochs,learning_rate,output_dir). Executetrainer_stats = trainer.train(). This step’s duration depends on dataset size and GPU power (e.g., 10 minutes for 500 examples on a T4 GPU).
D. Testing the Fine-Tuned Model
- Inference Mode: Prepare the model for inference using
FastLanguageModel.for_inference(model). - Test Prompt: Create a test prompt in the specified message format (e.g.,
{"role": "user", "content": "Extract the product information: <div>...</div>"}). - Generate Response: Use the tokenizer to prepare inputs and then
model.generate()to get the output. - Decode & Print: Decode the output to see the model’s response. The video demonstrates that the fine-tuned model successfully extracts the information in the desired JSON format.
E. Deploying the Model with Ollama
-
Save Model in GGUF Format: Save the fine-tuned model in the
ggufformat, which is compatible with Ollama, usingmodel.save_pretrained_gguf("gguf_model", tokenizer, quantization_method="q4_k_m"). This process can take a significant amount of time (e.g., 10-20 minutes). -
Download Model to Local Machine: After saving, the
.gguffile will be available in your Colab environment. Usefrom google.colab import filesandfiles.download(gguf_file)to download it to your local computer. -
Create a Modelfile for Ollama: Create a new directory (e.g.,
ollama-test) on your local machine and place the downloaded.gguffile inside. Insideollama-test, create a new file namedModelfile(no extension). Edit theModelfilewith the following structure (adjusting the.gguffilename and template as needed):FROM ./unsloth.Q4_K_M.gguf # Your downloaded model file PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER stop "<|end_of_text|>" PARAMETER stop "<|user|>" TEMPLATE """<|user|> {{ .Prompt }}<|assistant|> """ SYSTEM """You are a helpful AI assistant.""" -
Add Model to Ollama: Navigate to your
ollama-testdirectory in your terminal. Runollama create html-model -f Modelfile(replacehtml-modelwith your desired model name). -
Run and Test Locally: Verify the model is listed using
ollama list. Run your fine-tuned model locally usingollama run html-model. Paste your test prompt. The model should respond with structured output based on its fine-tuning.
This process enables you to create and deploy custom, specialized LLMs for your specific tasks directly on your local machine using Ollama.