RAG re-ranking with pruning - channel Prompt Engineering
https://www.youtube.com/watch?v=TvWhDZGzJiI This video introduces a context engineering technique called Provence that aims to substantially reduce hallucination in Retrieval Augmented Generation (RAG) systems by efficiently pruning irrelevant information from retrieved contexts before they are fed to the Large Language Model (LLM). Here’s a breakdown of the key points:
-
The “Garbage In, Garbage Out” Problem in RAG: Regardless of whether you use an advanced RAG or a standard RAG system, the quality of the context provided to the LLM is crucial. Bad context leads to bad outputs (hallucinations). Vanilla RAG: User query → Index → Documents (top-k chunks) → LLM → Response. The main issue here is noise in the retrieved chunks. RAG with Re-ranker: User query → Index → Documents (top-n chunks) → Re-ranker (selects top-k relevant chunks) → LLM → Response. While rerankers improve chunk relevance, they still pass entire chunks. Even within highly relevant chunks, many sentences might be irrelevant to the specific user query, introducing noise.
-
Practical Demonstration of the Problem: The speaker uses a local RAG system (localGPT) to query a technical report (DeepSeek-V3) about its training cost. Despite the correct cost being present in the document, the initial RAG output (even with hybrid search, 20 retrieved chunks, and reranking to 10 chunks) provides only the GPU hours, not the dollar amount. The retrieved chunks are large (512 tokens), and even the “relevant” ones contain extraneous information like table data and details about training mechanisms, burying the specific cost figure. The LLM struggles to extract the precise number due to this noise.
-
Introducing Provence: Context Pruning for RAG: Provence is based on a new paper (January 2025) titled “Provence: efficient and robust context pruning for retrieval-augmented generation.” Core Idea: Instead of just filtering chunks, Provence works at the sentence level. It takes a retrieved chunk and the user query, and then removes only the sentences that are irrelevant to the query, while still preserving the local context (i.e., it doesn’t process sentences in isolation, ensuring clarity for dependent sentences). Key Features: Sentence-level relevance: It identifies and removes only truly irrelevant sentences within a chunk. Context Preservation: It encodes all sentences together using a cross-encoder architecture, allowing it to understand the relationships between sentences and avoid pruning out necessary contextual information. Automatic Relevance Detection: Provence automatically determines the number of relevant sentences, removing the need to set this as a manual hyperparameter. Performance: Provence consistently outperforms other approaches (including standalone rerankers) in various benchmarks, maintaining performance with little-to-no drop while achieving significant context compression (e.g., 80% compression rate in the provided example). This means fewer tokens are sent to the LLM, leading to faster inference and lower costs.
-
Implementation and Usage: The Provence model (
naver/provence-reranker-debertav3-v1) is available on Hugging Face. The speaker provides a Google Colab notebook demonstrating its usage. You load the model, provide the full context and the question, and Provence returns apruned_context(a much smaller, cleaner text snippet) along with areranking_scoreandcompression_rate. Integration into RAG Pipeline: Provence can be integrated as an additional step after the initial retriever and re-ranker, or it can potentially replace the re-ranker entirely, directly taking the initial retrieved chunks and pruning them before passing to the LLM. -
Limitations: The primary limitation highlighted is the license: Provence is currently licensed under CC BY-NC 4.0, which means it cannot be used for commercial purposes. The speaker hopes for an Apache 2.0 or MIT-licensed version to be trained by the community in the future.
In conclusion, Provence offers a promising approach to refine context for RAG systems, leading to more accurate, less hallucinatory, and more efficient responses by intelligently reducing noise at the sentence level. https://github.com/PromtEngineer/localGPT
https://colab.research.google.com/drive/1sMVAivJ1pn-7iNnByEPF4aUlQPCt_s39?usp=sharing