Adam Lucek RAG embedding model fine tuning
https://www.youtube.com/watch?v=v28Pu7hsJ0s This detailed summary outlines the key concepts, methodology, and results presented in the video “Fine Tuning Embedding Models for Retrieval on Domain Specific Data.” The video, presented by Adam Lucek, focuses on optimizing Retrieval Augmented Generation (RAG) pipelines by fine-tuning embedding models for domain-specific data. 1. The Importance of Embedding Models in RAG
- Embedding models are crucial for RAG pipelines as they convert unstructured data (images, documents, audio) into dense vector representations.
- These vectors enable semantic similarity retrieval, which is the backbone of effective RAG.
- The accuracy of the retrieval step directly impacts the quality of answers generated by the Language Model (LLM), as the LLM’s generation is augmented by the retrieved context. An inaccurate retrieval step can lead to erroneous LLM outputs.
2. The Problem with Generalized Embedding Models
- While powerful, standard and generalized embedding models (like OpenAI’s or other open-source models) often suffer in performance when applied to domain-specific or niche content.
- They may fail to retrieve the most relevant or useful documents from an end-user perspective, leading to suboptimal RAG performance.
3. The Solution: Full Fine-Tuning
- This video specifically explores full fine-tuning of open-source embedding models on a custom knowledge base. (This differs from previous work by the presenter on linear adapters for query embeddings).
- The goal is to significantly boost retrieved document quality with minimal data preparation using the
Sentence Transformerslibrary.
4. Dataset Preparation (**legal-rag-positives-synthetic**)
- Objective: Improve embedding model performance on unseen queries against an existing knowledge base. The method is not primarily aimed at generalizing to entirely new documents outside of the knowledge base.
- Data Structure: Embedding model training requires specific dataset structures. The video focuses on Positive Pair: a pair of related sentences (in this case, a query and its corresponding relevant text chunk). Other common structures mentioned include Triplets (anchor, positive, negative) and Pair with Similarity Score.
- Specific Dataset: A synthetic dataset of Q&A chunk pairs derived from legal documents (court opinions mentioning AI, sourced from CourtListener’s public API). Knowledge Base: Consists of 10 legal court case opinions. Synthetic Pairs: GPT-4o was used to generate ~6,500 question-answer pairs (where the “answer” is a text chunk from the legal documents that can fully answer the question).
- Data Pre-processing: Loaded from Hugging Face Hub. Columns renamed:
questiontoanchor,texttopositive. A simple 1Didcolumn was added. Data shuffled and split into a 90/10 train/test set. - Important Note on Negatives: The method used here (Multiple Negatives Ranking Loss) works with positive pairs and implicitly samples other documents in the batch as negatives. Hard negatives (very similar but irrelevant documents) can further boost performance if available.
5. Base Model Evaluation & Matryoshka Dimensions
-
Candidate Model:
nomic-ai/modernbert-embed-base. This is a Sentence Transformer model that outputs 768-dimensional dense vectors. -
Matryoshka Representation Learning (MRL): This technique trains models to encode information at different granularities within the same embedding vector. Higher-level information is packed into earlier dimensions, and finer details into later dimensions. Benefit: Allows for flexible truncation of the embedding to different sizes (e.g., 768, 512, 256, 128, 64 dimensions) while maintaining comparable accuracy. This significantly improves vector storage and retrieval speeds.
-
Evaluation Setup (
**InformationRetrievalEvaluator**): This evaluator requires three key data structures: A corpus dictionary mapping IDs to text chunks. A queries dictionary mapping query IDs to questions. Arelevant_docsdictionary specifying which corpus documents are relevant for each query (usingglobal_chunk_idto handle multiple questions referring to the same chunk). -
Evaluation Metrics: Accuracy@k: Measures if at least one relevant document is in the top-k results. NDCG@k (Normalized Discounted Cumulative Gain): Captures both presence and positioning of relevant documents in ranked results, valuing higher-ranked relevant documents more. This is considered the “true north star” for optimization. Precision@k & Recall@k: Complementary metrics evaluating retrieval effectiveness. Precision@k: Fraction of relevant documents among the top-k results. Recall@k: Fraction of all relevant documents found within the top-k results. Mean Reciprocal Rank (MRR@k): Focuses on the position of the first relevant document. Mean Average Precision (MAP@k): Provides a comprehensive single-score assessment of ranking quality, incorporating both precision at each relevant document position and total recall.
-
Base Model Performance (Pre-Fine-tuning): The baseline
ndcg@10at 768 dimensions was approximately 0.4435. Performance generally dropped with truncation (e.g., 64d was 0.2682).
6. Training the Model
- Model Loading: The
modernbert-embed-basemodel was loaded withScaled Dot Product Attention (SDPA)for GPU efficiency. - Loss Function:
MultipleNegativesRankingLosswas chosen, as it’s effective for retrieval scenarios with anchor/positive pairs, creating negative examples from other samples in the batch. - MRL Integration: The
MatryoshkaLosswrapper was used to apply theMultipleNegativesRankingLossacross all specified Matryoshka dimensions (768, 512, 256, 128, 64), ensuring the model learns useful representations at truncated sizes. - Training Arguments (
**SentenceTransformerTrainingArguments**):num_train_epochs=4per_device_train_batch_size=32(global batch size 512 withgradient_accumulation_steps=16)per_device_eval_batch_size=16warmup_ratio=0.1learning_rate=2e-5Cosine learning rate scheduler, fused AdamW optimizer. BF16 precision (for GPU efficiency).BatchSamplers.NO_DUPLICATESto avoid duplicate negative samples. Evaluated after each epoch, saving the best model based oneval_dim_128_cosine_ndcg@10. - Training Duration: The training process took approximately 6 minutes.
7. Evaluating the Fine-Tuned Model
- The fine-tuned model was evaluated using the same
InformationRetrievalEvaluatorand metrics as the base model. - Results (Fine-Tuned vs. Base Comparison Table): NDCG@10: Increased by 48.5% at 768 dimensions, and by 59.4% at 64 dimensions. Notably, the fine-tuned 64-dimensional model (0.4275) performed almost as well as the base 768-dimensional model (0.4435). MRR@10: Saw increases ranging from 52.7% to 67.5%. MAP@100: Improved from 46.6% to 60.4%.
- Conclusion on Generalization: The impressive results indicate that the fine-tuning successfully generalized to unseen queries across the existing knowledge base. Further testing would be needed to understand its generalization to unseen documents outside of the knowledge base.
8. Using the Fine-Tuned Model
- The resulting fine-tuned model is available on Hugging Face Hub (
AdamLucek/ModernBERT-embed-base-legal-MRL). - It can be easily loaded using
SentenceTransformerand itstruncate_dimargument allows for loading different MRL dimensions (e.g.,truncate_dim=256for the 256-dimensional representation).
The video concludes by reiterating the power and relative ease of fine-tuning embedding models for domain-specific retrieval, highlighting the significant performance gains achievable.