05 Hands-on Project: Enterprise Knowledge Base
Summary
This material explores a winning RAG system for an enterprise knowledge base challenge, focusing on processing complex annual reports for Q&A. It details the complete RAG pipeline from custom PDF pars
RAG Challenge & Winning Architecture
The RAG Challenge focused on building a Q&A system from 100 random company annual reports (up to 1000 pages each) within 2.5 hours, answering 100 template-based questions with precise answers and page citations. The winning solution, RAG-Challenge-2, excelled in retrieval, generation, and scoring, utilizing o3-mini for inference. Its key features included:
- Custom PDF parsing with Docling (later MinerU).
- Vector search with parent document retrieval.
- LLM reranking for context relevance.
- Chain-of-Thought (CoT) reasoning for structured output.
- Query routing for multi-company comparisons and dynamic knowledge bases.
Basic RAG System Components
A foundational RAG system follows these steps:
- Parsing: Preparing data by converting documents to text and cleaning noise.
- Ingestion: Creating and loading the knowledge base.
- Retrieval: Finding relevant data based on user queries, often via semantic search in a vector database.
- Answering: Sending retrieved data + user prompt to an LLM to generate the final answer.
Parsing and Data Preprocessing
PDF parsing is crucial and challenging due to complex structures (tables, multi-column text, figures, headers/footers).
- Parser Choice: After testing over 20 parsers, Docling was chosen for its quality, though it required custom modifications to retain metadata (JSON output), convert tables to Markdown/HTML, and clean specific syntax errors with regex.
- Table Serialization: Initially considered to improve semantic coherence and LLM understanding of large tables. However, extensive testing revealed that table serialization slightly decreased system effectiveness, so it was ultimately not adopted.
- Table Format for LLMs: Markdown was initially used for tables, but switching to HTML format significantly improved LLM comprehension, especially for complex structures with merged cells and subtables.
Content Ingestion Strategies
- Chunking: Instead of using an entire page as a chunk, which dilutes relevance, pages were split into 300-token chunks (approx. 15 sentences). Each chunk stores its ID and parent page number in its metadata, crucial for later retrieval.
- Vectorization: Rather than a single large database, 100 separate Faiss databases were created, one per company. This approach ensures clearer structure and allows direct retrieval from a company-specific database, improving efficiency.
IndexFlatIPwas used for high precision brute-force search.text-embedding-3-large(or a suitable alternative) was used for embeddings.
Advanced Retrieval Techniques
Retrieval is key; "Garbage in, garbage out" applies here.
-
Hybrid Search: Combining vector database semantic search with keyword-based BM25 was explored. While theoretically beneficial, in its basic implementation, it often reduced retrieval quality in this challenge.
-
LLM Reranking: Utilizes an LLM to re-score the relevance of retrieved chunks/pages. This fine-grained semantic analysis improves context relevance.
-
The LLM formats its output with
reasoningandrelevance_score(0-1). -
A weighted average (e.g.,
vector_weight = 0.3,llm_weight = 0.7) combines the initial vector score with the LLM's reranking score. -
This is more expensive than pure vector search but more accurate, still requiring initial embedding-based filtering for efficiency.
Pythonsystem_prompt_rerank_single_block = """ 你是一个RAG检索重排专家。 你将收到一个查询和一个检索到的文本块,请根据其与查询的相关 性进行评分。 评分说明: 1. 推理:分析文本块与查询的关系,简要说明理由。 2. 相关性分数(0-1,步长0.1): 0 = 完全无关 ... (details for scores) 1 = 完全匹配 3. 只基于内容客观评价,不做假设。 """ -
-
Parent Page Retrieval: After retrieving Top N relevant chunks (which act as pointers), the entire corresponding pages are included in the context. This captures secondary but important details that might surround the core answer in a smaller chunk.
-
Assembled Retriever Workflow:
-
Vectorize query.
-
Find Top 30 relevant chunks.
-
Extract unique parent pages from chunk metadata.
-
Process these pages using an LLM reranker.
-
Adjust page relevance scores.
-
Return Top 10 highest-scoring pages, formatted with page numbers.
-
Prompt Engineering and Augmentation
Effective prompt management is crucial. Prompts were stored in prompts.py and logically split:
-
Core System Instructions: Defines the LLM's role and rules (e.g.,
AnswerWithRAGContextSharedPrompt.instruction). -
Pydantic Schema for Response Format: Enforces structured output (e.g., JSON) for easier parsing and validation.
Pythonclass ComparativeAnswerPrompt: class AnswerSchema(BaseModel): step_by_step_analysis: str = Field(description="详细分步推理过程...") reasoning_summary: str = Field(description="简要总结推理过程...") relevant_pages: List[int] = Field(description="保持为空列表。") final_answer: Union[str, Literal["N/A"]] = Field(description="公司名称需与问题中完全一致...") -
Few-shot Examples: Provides "question-answer" examples to guide the LLM's output style and format.
-
Context and Query Templates: Inserts retrieved context and user queries dynamically.
Pythonuser_prompt = """ 以下是上下文: \"\"\" {context} \"\"\"
以下是问题:
"{question}"
"""
```
Generation with LLMs
High-quality generation relies on several techniques:
- Query Routing to Database: In competitive settings, questions explicitly mentioning company names allow direct routing to the specific company's vector database, reducing search scope and improving efficiency.
- Query Routing to Prompts: For questions with specific answer types (e.g., int, float, bool, string list), the system uses
if-elselogic to select pre-designed prompt templates tailored to each data type. This simplifies tasks for the LLM, reducing error rates by minimizing rules per request. - Compound Query Routing: For complex comparative questions (e.g., "Which company has higher revenue?"), the LLM first decomposes the question into multiple independent sub-questions, processes them in parallel, and then synthesizes the final answer from the collected data.
- Chain of Thought (CoT): Encourages the model to "think step by step" before answering. For weaker models, explicit guidance on reasoning steps, goals, and examples is critical to prevent "fake reasoning."
- Structured Outputs (SO): Uses Pydantic schemas to force the model to return standardized JSON formats. This ensures consistency and simplifies post-processing.
- CoT + SO: Combining these provides a dedicated field for detailed reasoning (
step_by_step_analysis) and a separate field for the concise final answer (final_answer), allowing direct extraction of the answer without parsing lengthy explanations. - Instruction Refinement: This involves extensive iteration and debugging to define how the AI should interpret ambiguous requests and handle edge cases (e.g., what alternative job titles count as "CEO", or how to respond if information is not found).
RAG System Tuning and Optimization
- Configuration: All key functionalities are configurable (e.g.,
use_serialized_tables,parent_document_retrieval,llm_reranking,top_n_retrieval,api_provider,answering_model). - Validation Set: A manually answered validation set is crucial for objectively measuring performance, identifying error patterns, and refining prompts and hyperparameters. This helps in quantizing improvements and uncovering implicit rules. For example, it confirmed that table serialization did not improve performance.
class RunConfig:
use_serialized_tables: bool = False
parent_document_retrieval: bool = False
llm_reranking: bool = False
top_n_retrieval: int = 10
api_provider: str = "dashscope" #openai
answering_model: str = "qwen-turbo-latest"
Building a Custom RAG System (Case Study)
The material outlines steps to adapt the RAG-Challenge-2 codebase for custom use:
-
Run RAG-Challenge-2 Pipeline:
python -m src.pipeline. -
Simplify Markdown Generation: Modify parsing and merging steps to output simplified per-page Markdown.
-
Replace Docling with MinerU:
-
Use MinerU API for PDF parsing (
get_task_id,get_result). -
Need an API key and to upload PDFs (or use a local MinerU setup).
Python# Example MinerU integration (simplified) import requests import time api_key = 'YOUR_API_KEY' def get_task_id(file_name): url = 'https://mineru.net/api/v4/extract/task' headers = {"Authorization": f"Bearer {api_key}", 'Content-Type': 'application/json'} data = {'url': f'https://your_storage_url/{file_name}', 'is_ocr': True} res = requests.post(url, headers=headers, json=data) return res.json()["data"]['task_id'] def get_result(task_id): url = f'https://mineru.net/api/v4/extract/task/{task_id}' headers = {"Authorization": f"Bearer {api_key}", 'Content-Type': 'application/json'} while True: res = requests.get(url, headers=headers) result = res.json()["data"] state = result.get('state') if state == 'done': print(f"Task done. Download from: {result.get('full_zip_url')}") # Logic to download and unzip return elif state in ['pending', 'running']: print("Task not complete, waiting...") time.sleep(5) else: print(f"Error: {result.get('err_msg')}") return -
-
Modify
text_splitter.py: Add functions likesplit_markdown_reportsandsplit_markdown_fileto chunk Markdown files instead of JSON. These functions split by lines, recording start/end line numbers. -
Add
AnswerWithRAGContextStringPrompt: Implement a new prompt type for open-ended questions where the answer is a block of text (kind=string), similar to existingnumber,boolean,name,namesprompts. This involves defining theAnswerSchemawith afinal_answerof typestrand updatingapi_requests.pyto handle this newkind. -
Build a Streamlit UI: Create a simple frontend where users can input questions, and the backend calls the RAG pipeline to display the answers. This involves
streamlitcomponents and integrating theprocess_questions()logic.
Key Takeaways
- Systematic Optimization: Success in RAG relies on systematically optimizing each pipeline component rather than relying on a single "magic solution."
- Data Quality is Paramount: High-quality parsing and data preparation are fundamental, ensuring accurate and structured input for the RAG system.
- Intelligent Retrieval & Reranking: Efficient retrieval, combined with advanced techniques like LLM reranking and parent page retrieval, significantly enhances the relevance of retrieved context.
- Sophisticated Prompt Engineering: Meticulous prompt design, including Chain of Thought, structured outputs, query routing, and instruction refinement, allows even smaller LLMs to achieve high performance and precise answers.
- Iteration and Tuning: Continuous iteration with a validation set and hyperparameter tuning is essential for understanding system behavior and driving incremental improvements across all RAG stages.