02 RAG Technology and Applications

RAG (Retrieval-Augmented Generation) Technology and Applications

RAG (Retrieval-Augmented Generation) is a technology that combines information retrieval and text generation. It enhances the timeliness and accuracy of generated results by retrieving external knowledge in real time and feeding it into large language models (LLMs) as contextual input.

Large Model Application Development Models

There are three primary development models for large language model applications:

Prompt Engineering: Focuses on guiding LLMs to generate desired outputs by optimizing user prompts.
RAG (Retrieval-Augmented Generation): Addresses the issue of LLMs lacking real-time or domain-specific knowledge.
Fine-tuning: Involves training an LLM on a specific dataset to better adapt it to a particular task or domain.

When LLMs produce erroneous responses, RAG is primarily used to address the issue of "lack of background knowledge."

Advantages of RAG

Addressing the issue of knowledge timeliness: Since LLM training data is typically static, RAG can retrieve information from external knowledge bases to update information in real time.
Reducing model hallucinations: Incorporating external knowledge reduces the likelihood of the model generating false or inaccurate content.
Improves the quality of domain-specific answers: RAG can integrate domain-specific knowledge bases to generate answers with greater professional depth.

Core Principles and Process of RAG

The RAG process is generally divided into three main stages:

Step 1: Data Preprocessing (Indexing)

Knowledge Base Construction: Collect and organize data from multiple sources, such as documents, web pages, and databases.
Document Chunking: Dividing documents into appropriately sized chunks to balance semantic integrity and retrieval efficiency.
Vectorization: Use an embedding model to convert text chunks into vectors and store them in a vector database.

Step 2: Retrieval

Query Processing: Convert the user’s query into a vector.
Similarity Retrieval: Perform similarity retrieval in the vector database to identify the most relevant text chunks.
Re-ranking: Rank the retrieved results by relevance and select the most relevant chunks as input for the generation phase.

Step 3: Generation Phase

Context Assembly: Combine the retrieved text fragments with the user’s query to form an enhanced contextual input.
Answer Generation: The large language model generates the final answer based on the enhanced context.

NativeRAG

NativeRAG refers to the three core steps of RAG: Indexing, Retrieval, and Generation. Although the concept is simple, the actual process of building and implementing it involves a significant amount of complex work, such as how to better store knowledge, how to find useful information within a vast amount of data, and how to combine user queries with knowledge to generate useful answers.

Embedding Model Selection

Selecting the appropriate embedding model is critical to the performance of a RAG system. Hugging Face’s MTEB leaderboard provides comparisons of over 100 text embedding models across more than 1,000 languages.

Common Embedding Model Categories and Examples:

General-Purpose Text Embedding Models

Code
_BGE-M3 (Beiyuan Institute)_ *: Supports over 100 languages, with an input length of up to 8,192 tokens. It integrates dense, sparse, and multi-vector hybrid retrieval, making it suitable for cross-language long-document retrieval and high-precision RAG applications. _text-embedding-3-large (OpenAI)_ *: With a vector dimension of 3072, it excels at capturing the semantics of long texts and performs exceptionally well in English. _Jina-embeddings-v2-small (Jina AI)_ *: Only 35M parameters, supports real-time inference (RT < 50 ms), suitable for lightweight deployment.
Chinese Embedding Models

Code
_xiaobu-embedding-v2*_: Optimized for Chinese semantics with strong semantic understanding capabilities; suitable for Chinese text classification and semantic retrieval. _M3E-Base*_: A lightweight model optimized for Chinese, suitable for on-premises private deployment, and applicable to search tasks in Chinese legal and medical fields. _stella-mrl-large-zh-v3.5-1792*_: Strong capability in processing large-scale Chinese data, captures subtle semantic relationships, suitable for advanced Chinese text semantic analysis and natural language processing tasks.
Command-Driven and Complex Task Models

Code
_gte-Qwen2-7B-instruct (Alibaba)_ *: Fine-tuned based on the Qwen large model, supports cross-modal retrieval of code and text, and is suitable for complex instruction-driven tasks and intelligent question-answering systems. _E5-mistral-7B (Microsoft)_ *: Based on the Mistral architecture, excels at zero-shot tasks, suitable for complex systems requiring dynamic adjustment of semantic density.

Code example: BGE-M3 uses

Python

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

embeddings_1 = model.encode(sentences_1, 
                            batch_size=12, 
                            max_length=8192 # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                           )['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']

similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# 输出示例:
# [[0.626  0.3477]
#  [0.3499 0.678 ]]

similarity = embeddings_1 @ embeddings_2.T to compute the cosine similarity matrix between two sets of sentence embedding vectors via matrix multiplication.

Code example: gte-Qwen2 uses (SentenceTransformer wrapper)

Python

from sentence_transformers import SentenceTransformer

model_dir = "/root/autodl-tmp/models/iic/gte_Qwen2-1___5B-instruct" # 模型路径
model = SentenceTransformer(model_dir, trust_remote_code=True)
model.max_seq_length = 8192

queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
]

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())
# 输出示例:
# [[78.49691772460938, 17.04286003112793],
#  [14.924489974975586, 75.37960815429688]]

CASE: DeepSeek + Faiss for Building a Local Knowledge Base Retrieval System

This project aims to build an RAG-based local knowledge base retrieval system to answer questions about the "Account Manager Evaluation Guidelines" PDF document.

1. RAG Architecture:

Retrieval: Use vector similarity search to retrieve relevant content from PDF documents.
Augmentation: Uses the retrieved document fragments as context.
Generation: Generates answers based on the context and the user’s question.

2. Technology Stack Selection:

Vector Database: Faiss (High-Performance Vector Retrieval).
Embedding Model: text-embedding-v1 from Alibaba Cloud DashScope.
Large Language Model: deepseek-v3.
Document Processing: PyPDF2 (PDF text extraction).
Framework: LangChain (Q&A chain).

3. Program Logic Structure:

Step 1: Document Preprocessing

PDF Text Extraction:
- Extract text content page by page.
- Record the page number corresponding to each line of text.
- Handle blank pages and exceptional cases.
Text Segmentation Strategy:
- Use RecursiveCharacterTextSplitter.
- Splitting parameters: chunk_size=1000, chunk_overlap=200.
- Delimiter priority: paragraph → sentence → space → character.
Page number mapping:
- Calculate the page number for each text block based on character position.
- Determine the primary source page number for each text block using mode statistics.
- Establish a mapping relationship between text blocks and page numbers.

Python

from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_community.vectorstores import FAISS
from typing import List, Tuple
# from loguru import Logger # 假设有Logger

def extract_text_with_page_numbers(pdf) -> Tuple[str, List[int]]:
    """
    从PDF中提取文本并记录每行文本对应的页码
    参数: pdf: PDF文件对象
    返回: text: 提取的文本内容, page_numbers: 每行文本对应的页码列表
    """
    text = ""
    page_numbers = []
    for page_number, page in enumerate(pdf.pages, start=1):
        extracted_text = page.extract_text()
        if extracted_text:
            text += extracted_text
            # 简化处理：将整页的文本都标记为当前页码
            # 更精确的映射需要更复杂的逻辑，例如按行或字符范围
            page_numbers.extend([page_number] * len(extracted_text.split("\n")))
        # else:
            # Logger.warning(f"No text found on page {page_number}.")
    return text, page_numbers

def process_text_with_splitter(text: str, page_numbers: List[int], DASHSCOPE_API_KEY: str) -> FAISS:
    """
    处理文本并创建向量存储
    参数: text: 提取的文本内容, page_numbers: 每行文本对应的页码列表
    返回: knowledgeBase: 基于FAISS的向量存储对象
    """
    text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ".", " ", ""],
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    chunks = text_splitter.split_text(text)
    print(f"文本被分割成{len(chunks)} 个块。")

    embeddings = DashScopeEmbeddings(
        model="text-embedding-v1",
        dashscope_api_key=DASHSCOPE_API_KEY,
    )
    
    # 原始实现中page_info的创建可能存在问题，因为它简单地将chunk与page_numbers的索引对应
    # 正确的页码映射需要更复杂的逻辑，例如追踪每个chunk在原始文本中的起始和结束位置
    # 这里为了演示，我们假设每个chunk能大致对应一个页码（例如取第一个字符所在页码）
    # 实际应用中需要更精确的页码对应逻辑
    metadatas = []
    # 简化处理：对于每个块，尝试找到其在原始文本中的大致位置，并推断页码
    # 这是一个简化且可能不准确的页码映射方法，实际应用中需要更精确的逻辑
    chunk_start_idx = 0
    for i, chunk in enumerate(chunks):
        chunk_end_idx = chunk_start_idx + len(chunk)
        # 简单地取chunk的起始位置在原始文本中的行数，然后映射到页码

Step 2: Knowledge Base Construction

Use the DashScope embedding model to generate vectors.
Store the vectors in a Faiss index structure.
Data Persistence: Save the Faiss index file (.faiss), metadata (.pkl), and page number mappings.

Step 3: Q&A Query

Similarity Retrieval: Convert the user’s question into a vector, search Faiss for the most similar document blocks, and return the top-K relevant documents.
QA Chain Processing: Use LangChain’s load_qa_chain with the stuff strategy to combine documents, then send the combined context and the question to the LLM.
Answer Generation and Display: After the LLM generates the answer, display the result and record the source page number.

Python

from langchain.chains.question_answering import load_qa_chain
from langchain_community.callbacks.manager import get_openai_callback
from langchain_community.llms import Tongyi # 或其他LLM封装

# # 假设已经有了knowledgeBase对象和DASHSCOPE_API_KEY
# llm = Tongyi(model_name="deepseek-v3", dashscope_api_key=DASHSCOPE_API_KEY)

# query = "客户经理被投诉了，投诉一次扣多少分"
# # query = "客户经理每年评聘申报时间是怎样的？" # 第二个测试问题

# if query:
#     docs = knowledgeBase.similarity_search(query) # 执行相似度搜索

#     chain = load_qa_chain(llm, chain_type="stuff") # 加载问答链

#     input_data = {"input_documents": docs, "question": query}

#     # with get_openai_callback() as cost: # 如果是OpenAI模型可以使用此回调
#     response = chain.invoke(input=input_data)
#     # print(f"查询已处理。成本: {cost}")

#     print(response["output_text"])
#     print("来源:")
#     unique_pages = set()
#     for doc in docs:
#         # 页码信息现在应该在doc.metadata中
#         source_page = doc.metadata.get("source_page", "未知")
#         if source_page not in unique_pages:
#             unique_pages.add(source_page)
#             print(f"文本块页码: {source_page}")

Page Number Mapping Issue Explanation: The logic for page number mapping in the original data, knowledgeBase.page_info = {chunk: page_numbers[i] for i, chunk in enumerate(chunks)}, may be problematic because chunks are the result of text segmentation, and a direct index-based correspondence with the original page_numbers list is not precise.A more accurate page number mapping requires determining the page number based on the starting character position of the chunk within the original text. In LangChain, source information for each Document (i.e., the chunk here) is typically stored via metadata.

Question-Answer Chains (chain_type) in LangChain

LangChain Q&A chains provide four chain_type strategies for processing retrieved documents:

stuff (Stuffing):

Code
_Principle*_: Directly feed all documents to the LLM as a prompt. _Applicable Scenarios*_: When documents are split into small chunks, a limited number of documents are retrieved at a time, and the total token count does not exceed the LLM’s context window limit. _Advantages*_: Requires the fewest LLM calls, is highly efficient, and preserves full contextual continuity between documents. _Limitations*_: Prone to exceeding the LLM’s context window limit.
map_reduce:

Code
_Principle*_: Generate a separate prompt (response or summary) for each document block, then merge (reduce) all results to produce the final answer. _Applicable Scenarios*: Suitable_ for _scenarios_ with a large number of documents, where processing individual document blocks does not rely on context, or where independent summaries are required. _Advantages*_: Can process each document concurrently, avoiding exceeding the context window limit. _Limitations*_: Lack of direct context between documents; subtle details may be lost during the merging phase; multiple LLM calls incur high computational costs.
refine:

Code
_Principle*_: Generate an initial result by prompting the first document block, then combine that result with the next document to form a new prompt, iteratively “refining” the final answer. _Applicable Scenarios*_: Suitable when answers need to be gradually accumulated and refined, or when there is strong sequential dependency between documents. _Advantages*_: Partially preserves context; token usage remains within a reasonable range. _Limitations*_: Sequential processing results in relatively low efficiency and may be influenced by biases in early documents.
map_rerank:

Code
_Principle*_: Generate a prompt for each document block and ask the LLM to score each result; then return the result from the best-scoring document. _Applicable Scenarios*_: When selecting the "best" answer from multiple documents, with high requirements for answer quality. _Advantages*_: Effectively identifies the most relevant documents and provides confidence scores. _Limitations*_: Involves a large number of LLM calls; each document is processed independently, resulting in the highest cost.

The Significance of RAG Amid LLM’s Ability to Handle Infinite Context

Even if LLMs can process “unlimited context” in the future, RAG remains significant:

Efficiency and Cost: Processing extremely long contexts consumes significant computational resources and increases response times for LLMs. RAG reduces input length by retrieving relevant snippets, thereby lowering costs.
Knowledge Updates: LLM knowledge is limited to training data and cannot be updated in real time. RAG can connect to external knowledge bases, enhancing timeliness.
Explainability: RAG’s retrieval process is transparent, allowing users to view sources and build trust. In contrast, the generation process of LLMs is difficult to trace.
Customization: RAG allows for the customization of retrieval systems for specific domains, providing more precise results, whereas the general-purpose nature of LLMs may not meet specific needs.
Data Privacy: RAG allows for retrieval from local or private data sources, avoiding the upload of sensitive data to the cloud, making it suitable for scenarios with high privacy requirements.

Query Rewriting

The core of RAG lies in the “retrieval-generation” process; if the “retrieval” step goes awry, the quality of the subsequent “generation” will also decline. User queries are often colloquial, context-dependent, and vague, whereas knowledge base texts are typically declarative and objective. Therefore, query rewriting acts as a “translator,” converting users’ colloquial queries into formal, precise retrieval statements.

Carefully designed prompts guide the LLM to complete the query rewriting task.

1. Context-Dependent Query Rewriting

Used to rewrite vague queries that depend on preceding dialogue into independent, complete queries.

Python

instruction = """
你是一个智能的查询优化助手。请分析用户的当前问题以及前序对话历史，判断当前问题是否依赖于上下文。
如果依赖，请将当前问题改写成一个独立的、包含所有必要上下文信息的完整问题。
如果不依赖，直接返回原问题。
"""        
prompt = f"""
### 指令###
{instruction}
### 对话历史###
{conversation_history}
### 当前问题###
{current_query}
### 改写后的问题###
"""
# 示例：
# 对话历史: 用户: "我想了解一下上海迪士尼乐园的最新项目。" AI: "上海迪士尼乐园最新推出了'疯狂动物城'主题园区..."
# 当前查询: 还有其他设施吗？
# 改写结果: 除了疯狂动物城警察局、朱迪警官训练营和尼克狐的冰淇淋店之外，'疯狂动物城'园区还有其他设施吗？

2. Comparative Query Rewriting

Used to identify objects requiring comparison within a question and rewrite it into a clearer, comparative query.

Python

instruction = """
你是一个查询分析专家。请分析用户的输入和相关的对话上下文，识别出问题中需要进行比较的多个对象。
然后，将原始问题改写成一个更明确、更适合在知识库中检索的对比性查询。
"""        
prompt = f"""
### 指令###
{instruction}
### 对话历史/上下文信息###
{context_info}
### 原始问题###
{query}
### 改写后的查询###
"""
# 示例：
# 对话历史: 用户: "我想了解一下上海迪士尼乐园的最新项目。" AI: "上海迪士尼乐园最新推出了疯狂动物城主题园区，还有蜘蛛侠主题园区"
# 当前查询: 哪个游玩的时间比较长，比较有趣
# 改写结果: 哪个游玩时间更长、更有趣：上海迪士尼乐园的疯狂动物城主题园区和蜘蛛侠主题园区？

3. Ambiguous Reference Query Rewriting

Used to eliminate ambiguous pronouns such as “all,” “it,” and “this,” replacing them with specific object names.

Python

instruction = """
你是一个消除语言歧义的专家。请分析用户的当前问题和对话历史，找出问题中"都"、"它"、"这个" 等模糊指代词具体指
向的对象。
然后，将这些指代词替换为明确的对象名称，生成一个清晰、无歧义的新问题。
"""        
prompt = f"""
### 指令###
{instruction}
### 对话历史###
{conversation_history}
### 当前问题###
{current_query}
### 改写后的问题###
"""
# 示例：
# 对话历史: 用户: "我想了解一下上海迪士尼乐园和香港迪士尼乐园的烟花表演。" AI: "好的，上海迪士尼乐园和香港迪士尼乐园都有精彩的烟花表演。"
# 当前查询: 都什么时候开始？
# 改写结果: 上海迪士尼乐园和香港迪士尼乐园的烟花表演都什么时候开始？

4. Multi-Intent Query Rewriting - Query Decomposition

Used to break down complex user questions into multiple independent, simple questions that can be answered individually.

Python

instruction = """
你是一个任务分解机器人。请将用户的复杂问题分解成多个独立的、可以单独回答的简单问题。以JSON数组格式输出。
"""        
prompt = f"""
### 指令###
{instruction}
### 原始问题###
{query}
### 分解后的问题列表###
请以JSON数组格式输出，例如：["问题1", "问题2", "问题3"]
"""
# 示例：
# 原始查询: 门票多少钱？需要提前预约吗？停车费怎么收？
# 分解结果: ['门票多少钱？', '需要提前预约吗？', '停车费怎么收？']

5. Rewriting Rhetorical Queries

Used to identify user rhetorical questions or emotionally charged statements and rewrite them into neutral, objective questions that can be directly used for retrieval.

Python

instruction = """
你是一个沟通理解大师。请分析用户的反问或带有情绪的陈述，识别其背后真实的意图和问题。
然后，将这个反问改写成一个中立、客观、可以直接用于知识库检索的问题。
"""        
prompt = f"""
### 指令###
{instruction}
### 对话历史###
{conversation_history}
### 当前问题###
{current_query}
### 改写后的问题###
"""
# 示例：
# 对话历史: 用户: "你好，我想预订下周六上海迪士尼乐园的门票。" AI: "查询到下周六的门票已经售罄。"
# 当前查询: 这不会也要提前一个月预订吧？
# 改写结果: 迪士尼乐园门票是否需要提前一个月预订？

6. Automatic Query Type Identification and Rewriting

Enables an LLM to automatically identify the query type and perform the appropriate rewriting based on a single prompt.

Python

instruction = """
你是一个智能的查询分析专家。请分析用户的查询，识别其属于以下哪种类型：
1. 上下文依赖型- 包含"还有"、"其他"等需要上下文理解的词汇
2. 对比型- 包含"哪个"、"比较"、"更"、"哪个更好"、"哪个更"等比较词汇
3. 模糊指代型- 包含"它"、"他们"、"都"、"这个"等指代词
4. 多意图型- 包含多个独立问题，用"、"或"？"分隔
5. 反问型- 包含"不会"、"难道"等反问语气
说明：如果同时存在多意图型、模糊指代型，优先级为多意图型>模糊指代型
请返回JSON格式：
{
"query_type": "查询类型",
"rewritten_query": "改写后的查询",
"confidence": "置信度(0-1)"
}
"""        
prompt = f"""
### 指令###
{instruction}
### 对话历史###
{conversation_history}
### 上下文信息###
{context_info}
### 原始查询###
{query}
### 分析结果###
"""
# 示例输出见原始材料，展示了不同查询类型及其改写结果和置信度。

Query + Online Search

When the RAG system needs to process time-sensitive or real-time information, it must be combined with online search.

1. Determine Whether a Query Requires Web Search

Type | Keyword Features | Example Query | Reason

Timeliness | Latest, today, now, real-time, current | Is Shanghai Disneyland open today? | Requires the latest information for the current time

Price Information | How much, price, cost, ticket price | How much are tickets for next Saturday? | Prices frequently change

Operating Information | Operating Hours, Opening Time, Closing Time, Open or Closed | Is Disneyland open right now? | Operating status may change

Event Information | Events, Shows, Performances, Festivals, Celebrations | What special events are happening soon? | Event information is time-sensitive

Weather Information | Weather, Rain, Temperature | What will the weather be like at Disney tomorrow? | Weather information requires real-time updates

Transportation Information | How to get there, transportation, subway, bus | How to get there from Pudong Airport