LLM EngineeringMay 21, 2026

05 Hands-on Project: Enterprise Knowledge Base

LLMRAGHandsOnChunkingRerankingWeights

RAG Challenge & Winning Architecture

The RAG Challenge focused on building a Q&A system from 100 random company annual reports (up to 1000 pages each) within 2.5 hours, answering 100 template-based questions with precise answers and page citations. The winning solution, RAG-Challenge-2, excelled in retrieval, generation, and scoring, utilizing o3-mini for inference. Its key features included:

  • Custom PDF parsing with Docling (later MinerU).
  • Vector search with parent document retrieval.
  • LLM reranking for context relevance.
  • Chain-of-Thought (CoT) reasoning for structured output.
  • Query routing for multi-company comparisons and dynamic knowledge bases.

Basic RAG System Components

A foundational RAG system follows these steps:

  • Parsing: Preparing data by converting documents to text and cleaning noise.
  • Ingestion: Creating and loading the knowledge base.
  • Retrieval: Finding relevant data based on user queries, often via semantic search in a vector database.
  • Answering: Sending retrieved data + user prompt to an LLM to generate the final answer.

Parsing and Data Preprocessing

PDF parsing is crucial and challenging due to complex structures (tables, multi-column text, figures, headers/footers).

  • Parser Choice: After testing over 20 parsers, Docling was chosen for its quality, though it required custom modifications to retain metadata (JSON output), convert tables to Markdown/HTML, and clean specific syntax errors with regex.
  • Table Serialization: Initially considered to improve semantic coherence and LLM understanding of large tables. However, extensive testing revealed that table serialization slightly decreased system effectiveness, so it was ultimately not adopted.
  • Table Format for LLMs: Markdown was initially used for tables, but switching to HTML format significantly improved LLM comprehension, especially for complex structures with merged cells and subtables.

Content Ingestion Strategies

  • Chunking: Instead of using an entire page as a chunk, which dilutes relevance, pages were split into 300-token chunks (approx. 15 sentences). Each chunk stores its ID and parent page number in its metadata, crucial for later retrieval.
  • Vectorization: Rather than a single large database, 100 separate Faiss databases were created, one per company. This approach ensures clearer structure and allows direct retrieval from a company-specific database, improving efficiency. IndexFlatIP was used for high precision brute-force search. text-embedding-3-large (or a suitable alternative) was used for embeddings.

Advanced Retrieval Techniques

Retrieval is key; "Garbage in, garbage out" applies here.

  • Hybrid Search: Combining vector database semantic search with keyword-based BM25 was explored. While theoretically beneficial, in its basic implementation, it often reduced retrieval quality in this challenge.

  • LLM Reranking: Utilizes an LLM to re-score the relevance of retrieved chunks/pages. This fine-grained semantic analysis improves context relevance.

    • The LLM formats its output with reasoning and relevance_score (0-1).

    • A weighted average (e.g., vector_weight = 0.3, llm_weight = 0.7) combines the initial vector score with the LLM's reranking score.

    • This is more expensive than pure vector search but more accurate, still requiring initial embedding-based filtering for efficiency.

    Python
    system_prompt_rerank_single_block = """ 你是一个RAG检索重排专家。 你将收到一个查询和一个检索到的文本块,请根据其与查询的相关 性进行评分。 评分说明: 1. 推理:分析文本块与查询的关系,简要说明理由。 2. 相关性分数(0-1,步长0.1): 0 = 完全无关 ... (details for scores) 1 = 完全匹配 3. 只基于内容客观评价,不做假设。 """
  • Parent Page Retrieval: After retrieving Top N relevant chunks (which act as pointers), the entire corresponding pages are included in the context. This captures secondary but important details that might surround the core answer in a smaller chunk.

  • Assembled Retriever Workflow:

    1. Vectorize query.

    2. Find Top 30 relevant chunks.

    3. Extract unique parent pages from chunk metadata.

    4. Process these pages using an LLM reranker.

    5. Adjust page relevance scores.

    6. Return Top 10 highest-scoring pages, formatted with page numbers.

Prompt Engineering and Augmentation

Effective prompt management is crucial. Prompts were stored in prompts.py and logically split:

  1. Core System Instructions: Defines the LLM's role and rules (e.g., AnswerWithRAGContextSharedPrompt.instruction).

  2. Pydantic Schema for Response Format: Enforces structured output (e.g., JSON) for easier parsing and validation.

    Python
    class ComparativeAnswerPrompt: class AnswerSchema(BaseModel): step_by_step_analysis: str = Field(description="详细分步推理过程...") reasoning_summary: str = Field(description="简要总结推理过程...") relevant_pages: List[int] = Field(description="保持为空列表。") final_answer: Union[str, Literal["N/A"]] = Field(description="公司名称需与问题中完全一致...")
  3. Few-shot Examples: Provides "question-answer" examples to guide the LLM's output style and format.

  4. Context and Query Templates: Inserts retrieved context and user queries dynamically.

    Python
    user_prompt = """ 以下是上下文: \"\"\" {context} \"\"\"

Code
以下是问题: "{question}" """ ```

Generation with LLMs

High-quality generation relies on several techniques:

  • Query Routing to Database: In competitive settings, questions explicitly mentioning company names allow direct routing to the specific company's vector database, reducing search scope and improving efficiency.
  • Query Routing to Prompts: For questions with specific answer types (e.g., int, float, bool, string list), the system uses if-else logic to select pre-designed prompt templates tailored to each data type. This simplifies tasks for the LLM, reducing error rates by minimizing rules per request.
  • Compound Query Routing: For complex comparative questions (e.g., "Which company has higher revenue?"), the LLM first decomposes the question into multiple independent sub-questions, processes them in parallel, and then synthesizes the final answer from the collected data.
  • Chain of Thought (CoT): Encourages the model to "think step by step" before answering. For weaker models, explicit guidance on reasoning steps, goals, and examples is critical to prevent "fake reasoning."
  • Structured Outputs (SO): Uses Pydantic schemas to force the model to return standardized JSON formats. This ensures consistency and simplifies post-processing.
  • CoT + SO: Combining these provides a dedicated field for detailed reasoning (step_by_step_analysis) and a separate field for the concise final answer (final_answer), allowing direct extraction of the answer without parsing lengthy explanations.
  • Instruction Refinement: This involves extensive iteration and debugging to define how the AI should interpret ambiguous requests and handle edge cases (e.g., what alternative job titles count as "CEO", or how to respond if information is not found).

RAG System Tuning and Optimization

  • Configuration: All key functionalities are configurable (e.g., use_serialized_tables, parent_document_retrieval, llm_reranking, top_n_retrieval, api_provider, answering_model).
  • Validation Set: A manually answered validation set is crucial for objectively measuring performance, identifying error patterns, and refining prompts and hyperparameters. This helps in quantizing improvements and uncovering implicit rules. For example, it confirmed that table serialization did not improve performance.
Python
class RunConfig: use_serialized_tables: bool = False parent_document_retrieval: bool = False llm_reranking: bool = False top_n_retrieval: int = 10 api_provider: str = "dashscope" #openai answering_model: str = "qwen-turbo-latest"

Building a Custom RAG System (Case Study)

The material outlines steps to adapt the RAG-Challenge-2 codebase for custom use:

  1. Run RAG-Challenge-2 Pipeline: python -m src.pipeline.

  2. Simplify Markdown Generation: Modify parsing and merging steps to output simplified per-page Markdown.

  3. Replace Docling with MinerU:

    • Use MinerU API for PDF parsing (get_task_id, get_result).

    • Need an API key and to upload PDFs (or use a local MinerU setup).

    Python
    # Example MinerU integration (simplified) import requests import time api_key = 'YOUR_API_KEY' def get_task_id(file_name): url = 'https://mineru.net/api/v4/extract/task' headers = {"Authorization": f"Bearer {api_key}", 'Content-Type': 'application/json'} data = {'url': f'https://your_storage_url/{file_name}', 'is_ocr': True} res = requests.post(url, headers=headers, json=data) return res.json()["data"]['task_id'] def get_result(task_id): url = f'https://mineru.net/api/v4/extract/task/{task_id}' headers = {"Authorization": f"Bearer {api_key}", 'Content-Type': 'application/json'} while True: res = requests.get(url, headers=headers) result = res.json()["data"] state = result.get('state') if state == 'done': print(f"Task done. Download from: {result.get('full_zip_url')}") # Logic to download and unzip return elif state in ['pending', 'running']: print("Task not complete, waiting...") time.sleep(5) else: print(f"Error: {result.get('err_msg')}") return
  4. Modify text_splitter.py: Add functions like split_markdown_reports and split_markdown_file to chunk Markdown files instead of JSON. These functions split by lines, recording start/end line numbers.

  5. Add AnswerWithRAGContextStringPrompt: Implement a new prompt type for open-ended questions where the answer is a block of text (kind=string), similar to existing number, boolean, name, names prompts. This involves defining the AnswerSchema with a final_answer of type str and updating api_requests.py to handle this new kind.

  6. Build a Streamlit UI: Create a simple frontend where users can input questions, and the backend calls the RAG pipeline to display the answers. This involves streamlit components and integrating the process_questions() logic.

Key Takeaways

  • Systematic Optimization: Success in RAG relies on systematically optimizing each pipeline component rather than relying on a single "magic solution."
  • Data Quality is Paramount: High-quality parsing and data preparation are fundamental, ensuring accurate and structured input for the RAG system.
  • Intelligent Retrieval & Reranking: Efficient retrieval, combined with advanced techniques like LLM reranking and parent page retrieval, significantly enhances the relevance of retrieved context.
  • Sophisticated Prompt Engineering: Meticulous prompt design, including Chain of Thought, structured outputs, query routing, and instruction refinement, allows even smaller LLMs to achieve high performance and precise answers.
  • Iteration and Tuning: Continuous iteration with a validation set and hyperparameter tuning is essential for understanding system behavior and driving incremental improvements across all RAG stages.

embed