04 RAG Advanced Techniques and Optimization

RAG Advanced Techniques and Optimization Overview

RAG (Retrieval Augmented Generation) involves three core steps: Indexing, Retrieval, and Generation. Optimizing these steps is crucial for building robust RAG applications.

Optimization Dimensions:

Solid Foundation: Knowledge Base Processing: Focuses on improving the quality and usability of the knowledge base.
Precise Radar: Efficient Retrieval: Enhances how relevant information is found within the knowledge base.
Global View: GraphRAG: Utilizes knowledge graphs for a more structured and interconnected understanding of information.
Intelligent Decision-making: Agentic RAG: Integrates RAG into intelligent agents for complex tasks.

Solid Foundation: Knowledge Base Processing

This section details methods to build and maintain a high-quality knowledge base.

Scenario 1: Knowledge Base Question Generation & Retrieval Optimization

To improve retrieval accuracy when user queries have low similarity to knowledge chunks, AI can generate potential questions for each chunk. These generated questions serve as an additional index for retrieval.

Core Concepts:

Automatic Question Generation: LLMs generate diverse questions (direct, indirect, comparative, conditional) for each knowledge chunk.
Dual Retrieval Index: Build BM25 indexes based on both original content and generated questions.
Retrieval Evaluation: Compare the accuracy of original content-based retrieval vs. question-based retrieval.

Example LLM Prompt for Diverse Questions:

JSON

{
"questions": [
{
"question": "问题内容",
"question_type": "问题类型",
"difficulty": "难度等级",
"perspective": "提问角度",
"is_answerable": "给出的知识能否回答该问题",
"answer": "基于该知识的回答"
}
]
}

Evaluation Results (Example):

BM25 Original Retrieval Accuracy: 66.7%
BM25 Question Retrieval Accuracy: 100.0%
Indicates significant improvement with question-based retrieval for certain queries.

Scenario 2: Conversational Knowledge Precipitation

This focuses on extracting and consolidating valuable knowledge from daily user-AI conversations to enrich the knowledge base.

Core Functions:

extract_knowledge_from_conversation(): Uses an LLM to extract structured knowledge (facts, needs, questions, procedures, precautions) from a single conversation, including confidence, source, keywords, category, summary, and user intent.
batch_extract_knowledge(): Processes multiple conversations.
merge_similar_knowledge(): Uses an LLM to combine similar knowledge points into a more comprehensive entry.

Example LLM Prompt for Knowledge Extraction:

JSON

{
"extracted_knowledge": [
{
"knowledge_type": "知识类型（事实/需求/问题/流程/注意）",
"content": "知识内容",
"confidence": "置信度(0-1)",
"source": "来源（用户/AI/对话）",
"keywords": ["关键词1", "关键词2"],
"category": "分类"
}
],
"conversation_summary": "对话摘要",
"user_intent": "用户意图"
}

Knowledge Merging Process:

Filter out temporary knowledge types (e.g., '需求' - needs, '问题' - questions).
Group remaining knowledge by type.
Use LLM to merge multiple knowledge points within each type, preserving important information, eliminating redundancy, and improving accuracy.

Example LLM Prompt for Knowledge Merging:

Python

prompt = f"""
你是一个专业的知识整理专家。请将以下{knowledge_type}类
型的知识点进行智能合并，生成一个更完整、准确的知识点。
### 合并要求：
1. 保留所有重要信息，避免信息丢失
2. 消除重复内容，整合相似表述
3. 提高内容的准确性和完整性
4. 保持逻辑清晰，结构合理
5. 合并后的置信度取所有知识点中的最高值
### 待合并的知识点：
{chr(10).join(knowledge_contents)}
### 请返回JSON格式：
{{
"knowledge_type": "{knowledge_type}",
"content": "合并后的知识内容",
"confidence": 最高置信度值,
"keywords": ["合并后的关键词列表"],
"category": "合并后的分类",
"sources": ["所有来源"],
"frequency": {len(knowledge_group)}
}}
### 合并结果：
"""

Scenario 3: Knowledge Base Health Check

Regularly checking the knowledge base health ensures quality and reliability by identifying missing, outdated, or conflicting information.

Core Functions (LLM-based):

Completeness Check: Assesses if the KB covers major user query needs.
Timeliness Check: Identifies outdated or needing-update content.
Consistency Check: Finds conflicts and contradictions.
Comprehensive Score: Provides a quantitative health score and improvement suggestions.

Example LLM Prompt for Missing Knowledge Check:

JSON

{
"missing_knowledge": [
{
"query": "测试查询",
"missing_aspect": "缺少的知识方面",
"importance": "重要性（高/中/低）",
"suggested_content": "建议的知识内容",
"category": "知识分类"
}
],
"coverage_score": "覆盖率评分(0-1)",
"completeness_analysis": "完整性分析"
}

Similar JSON output structures are used for outdated_knowledge and conflicting_knowledge, each with specific criteria and suggested actions.

Scenario 4: Knowledge Base Version Management & Performance Comparison

This involves managing different versions of the knowledge base, supporting regression testing, pre-launch acceptance, and comparing performance to select the optimal version.

Key Modules:

Text Vectorization: Converts text to high-dimensional vectors (e.g., 1024-dim) using an embedding API.

Python
def get_text_embedding(text): response = client.embeddings.create( model="text-embedding-v4", input=text, dimensions=1024 ) return response.data[0].embedding
Vector Index Building: Creates FAISS indexes for efficient similarity search.

Python
text_index = faiss.IndexFlatL2(1024) text_index.add_with_ids(vectors, ids)
Version Difference Detection: Identifies added, removed, or modified chunks between versions using ID mapping and set operations.
Vector Retrieval: Retrieves relevant chunks based on query vector similarity using FAISS.
Performance Evaluation: Quantifies retrieval performance (accuracy, response time) against a test set.
Performance Comparison: Compares metrics between two versions to recommend the best.
Regression Testing: Ensures new versions don't break existing functionality by validating against historical test cases.

Precise Radar: Efficient Retrieval

This section focuses on strategies to enhance the quality of retrieved documents.

Optimization Query Expansion (MultiQuery)

Concept: Uses an LLM to rewrite a user query into multiple semantically similar variants. This increases the diversity of retrieval and helps capture different ways users might ask the same question.

Example LLM Prompt for Multi-Query Generation:

Python

prompt = f"""你是一个AI助手，负责生成多个不同视角的搜索查询。
给定一个用户问题，生成{num_queries}个不同但相关的查询，以帮助检索更全面的信息。
每个查询应该从不同角度表达相同的信息需求。
原始问题: {query}
请直接输出{num_queries}个查询，每行一个，不要编号和其他内容:"""

Index Expansion (Hybrid Search: BM25 + Vector)

Combines the strengths of discrete (keyword-based) and continuous (vector-based) indexing.

Discrete Index Expansion:

Code
_Keyword Extraction:_* Extracts important keywords (e.g., using TF-IDF, TextRank) to supplement vector retrieval. _Entity Recognition:_* Identifies named entities (people, locations, organizations) for precise matching.
Hybrid Indexing:

Code
_BM25 Retrieval:_* Excellent for exact matches of specialized terms. _Vector Retrieval:_* Understands synonyms and semantic relationships. _Fusion:_* 1. Both BM25 and Vector retrieval run in parallel. 2. Scores from both methods are normalized to `[0, 1]`. 3. Weighted fusion combines scores: `Score_hybrid = α × Score_vector + (1 - α) × Score_BM25`. 4. The `alpha` parameter balances the weights (e.g., `α = 0.5` for balanced).

Hybrid Search Core Logic:

Python

# BM25 Retrieval
tokenized_query = tokenize_chinese(query) # e.g., ["客户", "经理", "投诉", "扣", "分"]
bm25_scores = self.bm25.get_scores(tokenized_query)
bm25_scores_normalized = [s / max(bm25_scores) for s in bm25_scores]

# Vector Retrieval
vector_results = self.vectorstore.similarity_search_with_score(query, k=len(self.chunks))
vector_scores = {idx: 1 - (distance / max_distance) for doc, distance in vector_results}

# Score Fusion
combined_scores = {}
for idx in all_doc_indices:
    vector_score = vector_scores.get(idx, 0)
    bm25_score = bm25_scores_normalized[idx]
    combined_scores[idx] = alpha * vector_score + (1 - alpha) * bm25_score

Rerank Model Usage

Reranking optimizes the initial retrieval results by re-sorting them based on a more granular relevance score, improving the final output's accuracy.

BGE-Rerank:
- Open-source (BAAI/bge-reranker-large), local deployment.
- Transformer-based Cross-Encoder, calculates direct query-document relevance.
- Good for Chinese tasks, data privacy.
- Scores are unnormalized logits (e.g., 3.0-10.0 for high relevance).

BGE-Rerank Example:

Python

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.eval()

pairs = [['what is panda?', 'The giant panda is a bear species endemic to China.']]
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
scores = model(**inputs).logits.view(-1).float()
print(scores) # Example output: tensor([4.9538])

Cohere Rerank:
- Commercial API service, cloud-based.
- Proprietary deep learning model, excellent for multi-lingual tasks.
- Easy integration, good for improving Hit Rate and MRR.
- Scores are normalized (e.g., 0-1).

Integration with Reranking:

Stage 1: Coarse-grained Retrieval: Generate multi-query variants, then use hybrid search to retrieve initial_k candidate documents.
Stage 2: Fine-grained Reranking: Pass candidate documents and the original query to the reranker, which sorts and returns the final_k most relevant documents.

Python

class Reranker:
    def __init__(self, model_name="BAAI/bge-reranker-base", cache_dir="./models"):
        # ModelScope download, tokenizer and model loading
        # ...
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

    def rerank(self, query: str, documents: List[Document], top_k: int = None):
        pairs = [[query, doc.page_content] for doc in documents]
        with torch.no_grad():
            inputs = self.tokenizer(pairs, padding=True, truncation=True, max_length=512, return_tensors="pt").to(self.device)
            scores = self.model(**inputs).logits.squeeze(-1).cpu().tolist()
        scored_docs = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
        return [doc for doc, _ in scored_docs[:top_k]]

def hybrid_multi_query_search_with_rerank(query, hybrid_retriever, reranker, llm, initial_k=10, final_k=4):
    queries = generate_multi_queries(query, llm)
    candidate_docs = []
    for q in queries:
        docs = hybrid_retriever.search(q, k=initial_k)
        candidate_docs.extend(docs)
    candidate_docs = deduplicate(candidate_docs)
    reranked_docs = reranker.rerank(query, candidate_docs, top_k=final_k)
    return reranked_docs

Bi-directional Rewriting (Query2Doc, Doc2Query)

Query2Doc: Expands a short user query into a longer, more descriptive document, which can then be used for more effective vector search.
Doc2Query: Generates a set of potential queries for a given document. This creates an inverse index that can improve recall for semantically diverse queries.

Small-to-Big Indexing Strategy

Concept: Indexes small-scale content (summaries, key sentences, paragraphs) and links them to their corresponding large-scale content (full documents).
Mechanism: Users query the small-scale index for quick relevance matching, then the system retrieves the full context from the linked large-scale content for detailed answer generation. This improves both efficiency and contextual coherence.

Global View: GraphRAG

GraphRAG is a structured, hierarchical RAG method that uses knowledge graphs to significantly improve query and answer performance, especially for complex information or when deep semantic understanding is needed. It addresses limitations of baseline RAG where information points are scattered or a macro understanding is missing.

GraphRAG Process:

Indexing:

Code
_TextUnits:_* Segment raw text into analysable units. _Extract Graph:_* Use LLMs to extract entities, relationships, and claims from TextUnits. _Community Detection & Summarization:_* Apply hierarchical clustering (e.g., Leiden algorithm) to the knowledge graph to form communities. Generate summaries for each community level, from bottom-up.
Querying: Utilize the structured graph and community summaries to enhance LLM context.

GraphRAG Indexing Data Flow Stages:

Combine TextUnits: Convert documents into TextUnits (text blocks for extraction).
Knowledge Graph Extraction: Analyze TextUnits to extract entities, relationships, and claims using LLMs. Merge similar entities/relations.
Knowledge Graph Enhancement: Understand community structures (hierarchical Leiden) and enhance the graph with embeddings (Node2Vec).
Community Summarization: Generate high-level summaries for each community using LLMs.
Document Processing: Create a "documents" table for the knowledge model.
Network Visualization: Perform UMAP dimensionality reduction for 2D visualization of the knowledge graph.

GraphRAG Query Modes:

Global Query:

Code
_Purpose:_* Answers overall questions about the corpus (e.g., "What are the main themes?"). _Mechanism:_* Uses a Map-Reduce approach on community reports from specified hierarchical levels. It aggregates and synthesizes information from various summaries. _Characteristics:_* Resource-intensive, but effective for questions requiring holistic understanding.
Local Query:

Code
_Purpose:_* Answers specific questions (e.g., "What are the therapeutic properties of chamomile?"). _Mechanism:_* Identifies semantically related entities from the user query, then extracts directly related content from the KG (TextUnits, community reports, entities, relationships, covariates). This filtered and re-ranked data forms the context for the LLM. _Characteristics:_* Combines structured KG data with raw document text for granular insights.

GraphRAG Setup and Execution (Example Commands):

Shell

git clone https://github.com/microsoft/graphrag.git
pip install -e .
graphrag init --root .
# Configure settings.yaml and .env (API key)
# Place documents in ./input
graphrag index --root . # Long-running process
graphrag query --root . --method global --query "What are the main themes of the documents?"
python -m graphrag.query --root ./cases --method local "Who did Guan Yu defeat in battles?"

Key Takeaways

Knowledge Base Quality is Paramount: Proactive question generation, conversational knowledge precipitation, and regular health checks are crucial for a robust RAG system.
Hybrid Retrieval and Reranking: Combining BM25 and vector search with intelligent rerankers (like BGE-Rerank or Cohere Rerank) significantly improves recall and precision by balancing keyword matching and semantic understanding.
GraphRAG for Complex Queries: Knowledge graphs provide a structured, hierarchical understanding of information, enabling RAG systems to answer complex, multi-hop, or high-level summarization queries that traditional vector search struggles with.
Version Control and Performance Evaluation: Implement version management with performance evaluation and regression testing to ensure continuous improvement and stability of the RAG system.

embed