03 RAG Multimodal Data Processing

Gemini Multimodal Processing

As a powerful multimodal model, Gemini is capable of processing and generating various forms of data, including text, images, audio, video, PDFs, and code. Its core advantages lie in:

Native Unified Architecture: All modalities are trained within the same representational space from the pre-training stage, rather than through post-processed add-on modules. This minimizes information loss and preserves temporal and detailed information more comprehensively.
End-to-End Inference: Utilizes a single set of Transformer parameters to directly process any combination of inputs, such as simultaneously processing CT images and medical records, or converting handwritten recipe videos directly into digital recipes.
Context Scale: Gemini 3 Pro supports long windows of up to 1 million tokens, capable of processing a 1-hour video or a 700-page PDF in a single pass and generating structured reports.
Generative Capabilities: Beyond understanding and analysis, it supports multimodal generation, such as text-to-image generation, image editing, and audio streaming.

Gemini API Usage Examples

Step 1: Apply for a Gemini API Key

Visit https://aistudio.google.com/app/api-keys to obtain an API key.

Step 2: Set Environment Variables

Set GEMINI_API_KEY and GOOGLE_API_KEY.

Text Output

Python

from google import genai

client = genai.Client()

# 文字输出
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="用中文解释AI大模型是如何工作的",
)
print(response.text)

Image Understanding

Python

from PIL import Image
from google import genai # Assuming genai is imported

client = genai.Client()

image = Image.open("dog_and_girl.jpeg")

# 注意：contents 变成了一个列表，里面同时放了图片对象和文字
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[image, "帮我解释下这张照片"]
)
print(response.text)

Video understanding

Python

import time
from google import genai # Assuming genai is imported

client = genai.Client()

# 1. 上传视频文件
print("正在上传视频...")
video_file = client.files.upload(file="car.mp4") # 汽车剐蹭视频
print(f"上传成功: {video_file.name}")

# 2. 等待视频处理(关键步骤！)
# 视频上传后，Google 需要几秒钟在云端进行转码。
while video_file.state.name == "PROCESSING":
    print("视频处理中，请稍候...")
    time.sleep(2)
    video_file = client.files.get(name=video_file.name)
    if video_file.state.name == "FAILED":
        raise ValueError("视频处理失败")
print("视频就绪，开始推理...")

# 3. 多模态推理
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[
        video_file, 
        "详细描述视频里发生了什么？如果有对话，请把关键对话提取出来。"
    ]
)
print(response.text)

CASE: Disney RAG Assistant

This case study aims to build a 24/7 online AI customer service assistant for Disney, capable of automatically answering frequently asked questions, providing accurate information, and handling multimodal queries.

Challenges

Diverse Knowledge Sources: The knowledge base includes PDFs (official regulations), Word documents (internal FAQs), web announcements, and event description documents containing images and tables.
Processing Unstructured Data: Effectively extracting and understanding information from tables and images in PDF and Word documents is key to the success of RAG.
Effective Organization of Knowledge: How to chunk and index vast amounts of scattered knowledge points to ensure retrieval accuracy.
Ensuring Answer Validity: How to ensure that the final generated answers are strictly based on retrieved content, avoiding hallucinations in the LLM.

Technology Selection (Solution 2 - Using Multimodal Embedding)

Embedding Model: Multimodal-Embedding (e.g., Alibaba Cloud Tongyi tongyi-embedding-vision-plus), which uniformly processes text, images, and video.
Vector Database: FAISS (for high-performance vector retrieval). In production environments, consider Milvus, ChromaDB, Elasticsearch, etc.
LLM: Qwen-flash (for generating final answers).
Workflow Orchestration: Directly use low-level APIs (without relying on frameworks like LangChain).

Using `Multimodal-Embedding`

The Multimodal-Embedding model converts data from different modalities—such as text, images, and video—into floating-point vectors within a unified vector space (for example, tongyi-embedding-vision-plus generates 1152-dimensional vectors). This enables cross-modal retrieval and similarity calculations within the same semantic space.

Multimodal-Embedding Model Comparison

tongyi-embedding-vision-plus | 1,152 | 1,024 tokens | ≤3MB, ≤8 images | ≤10MB | 0.0005 CNY

Text Embedding Example

Python

import dashscope
import json
from http import HTTPStatus

text = "上海迪士尼乐园门票分为一日票、两日票和特定日票三种类型。一日票可在购买时选定日期使用，价格根据季节浮动，平日成人票475元起"
input = [{'text': text}]

resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Image Embedding Example

Python

import dashscope
import base64
import json
from http import HTTPStatus

image_path = "./disney_knowledge_base/images/1-聚在一起说奇妙.jpg"
with open(image_path, "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
image_format = "jpg"
image_data = f"data:image/{image_format};base64,{base64_image}"

input = [{'image': image_data}]

resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Video Embedding Example

Python

import dashscope
import json
from http import HTTPStatus

# 多模态向量化模型目前仅支持以URL形式输入视频文件，暂不支持直接传入本地视频。
video = "https://dataset-1255932437.cos.ap-nanjing.myqcloud.com/mp4/car.mp4"
input = [{'video': video}]

resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Processing .docx Files

The parse_docx function uses the python-docx library to read .docx files, iterate through all elements (paragraphs and tables), and extract them into separate content blocks.

Paragraph processing: Extract plain text content, remove whitespace, and mark it as "type": "text".
Table Processing: Converts Word tables to Markdown format and marks them as "type": "table".

Python

# disney_bot.py
from docx import Document as DocxDocument
import os

def parse_docx(file_path):
    doc = DocxDocument(file_path)
    content_chunks = []
    for element in doc.element.body:
        if element.tag.endswith('p'):  # 段落处理
            paragraph_text = ""
            for run in element.findall('.//w:t', {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}):
                paragraph_text += run.text if run.text else ""            
            if paragraph_text.strip():
                content_chunks.append({"type": "text", "content": paragraph_text.strip()})                
        elif element.tag.endswith('tbl'):  # 表格处理
            # 转换为Markdown格式
            md_table = []
            table = [t for t in doc.tables if t._element is element][0]
            if table.rows:
                header = [cell.text.strip() for cell in table.rows[0].cells]
                md_table.append("| " + " | ".join(header) + " |")
                md_table.append("|" + "---|"*len(header))
                for row in table.rows[1:]:
                    row_data = [cell.text.strip() for cell in row.cells]
                    md_table.append("| " + " | ".join(row_data) + " |")
                table_content = "\n".join(md_table)
                if table_content.strip():
                    content_chunks.append({"type": "table", "content": table_content})
    return content_chunks

Processing .pdf files

The parse_pdf function uses the fitz (PyMuPDF) library to open and read PDF documents page by page, breaking them down into plain text and separate image files.

Text Extraction: Use page.get_text("text") to extract the plain text content from each page, save it as a separate block, and include the page number.
Image extraction: Detects and extracts all embedded images on the page, saves them to the specified image_dir directory, and records the image paths.

Python

# disney_bot.py
import fitz # PyMuPDF
import os

def parse_pdf(file_path, image_dir):
    doc = fitz.open(file_path)
    content_chunks = []
    for page_num, page in enumerate(doc):
        # 提取文本
        text = page.get_text("text")
        content_chunks.append({"type": "text", "content": text, "page": page_num + 1})
        # 提取图片
        for img_index, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_path = os.path.join(image_dir, f"{os.path.basename(file_path)}_p{page_num+1}_{img_index}.{image_ext}")
            with open(image_path, "wb") as f:
                f.write(image_bytes)
            content_chunks.append({"type": "image", "path": image_path, "page": page_num + 1})
    return content_chunks

Faiss Index Construction (4-disney_build_index.py)

Parse Documents: Use functions such as parse_docx() to process Word documents, extracting text paragraphs and tables.
Text Chunking: Use the split_text() function to split text into chunks of a fixed length (e.g., chunk_size=500 characters, with overlap=50 characters of overlap).
Multimodal Embedding:
- Use the tongyi-embedding-vision-plus model to uniformly process text, images, and videos.
- Text is encoded directly.
- Images are sent after Base64 encoding.
- Extract multiple frames from videos and calculate the average vector.
Build FAISS Index: Use IndexFlatL2 (L2 distance, Euclidean distance) for precise search indexing.
Persistent storage: Save the FAISS index as a .faiss file (disney_index.faiss) and the metadata as a JSON file (disney_metadata.json).

Query Processing (5-disney_query.py)

Load index: Load the FAISS index and metadata JSON from files.
Query Embedding: Convert the user query into a vector using the Multimodal-Embedding model, placing it in the same space as the vectors in the index.
Similarity Retrieval: Retrieve all records, sort them by L2 distance, and convert the results to similarity scores (sim = 1/(1+distance)).
Media Intent Detection: Determine whether the user is seeking media content by matching keywords (e.g., "image," "poster," "video," etc.).
Result Filtering:
- Text: Unconditionally select the Top-K (default k=3) most similar results.
- Images/Videos: Only adopt the closest matching media if media intent is detected and the distance is less than the threshold (e.g., 3.0).
LLM Generation: Construct a prompt (combining background knowledge and the user’s query), then call qwen-flash to generate an answer.

Metadata Format Example

JSON

// 文本类型
{
    "id": 0,
    "source": "退票政策.docx",
    "type": "text",
    "content": "退票内容..."
}
// 图片类型
{
    "id": 10,
    "source": "图片: poster.jpg",
    "type": "image",
    "path": "images/poster.jpg",
    "content": "[图片] poster.jpg"
}
// 视频类型
{
    "id": 15,
    "source": "视频: 汽车剐蹭",
    "type": "video",
    "url": "https://...",
    "description": "汽车剐蹭视频"
}

Key Parameter Configuration

Python

# 切分参数
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

# 媒体匹配阈值
MEDIA_DISTANCE_THRESHOLD = 3.0

# 关键词检测
IMAGE_KEYWORDS = ["图片", "海报", "照片", "看看", "长什么样"]
VIDEO_KEYWORDS = ["视频", "录像", "播放"]

# 模型配置
EMBEDDING_MODEL = "tongyi-embedding-vision-plus"
LLM_MODEL = "qwen-flash"

Unified Multimodal Vector Space and Retrieval Strategy

Text, images, and videos are mapped to a unified vector space via the same embedding model, enabling cross-modal semantic retrieval.

Retrieval Strategy: Unified Indexing + Post-Filtering

Unified Retrieval: A single query returns results across all modalities.
Intent Detection: Determines whether the user requires media content based on keywords.
Category Filtering:
- Text: Unconditionally select the Top-K results.
- Images/Videos: Only included if media intent is detected and the distance is less than MEDIA_DISTANCE_THRESHOLD (e.g., 3.0).
LLM Generation: Text results are unconditionally fed into the prompt, while media is included as an attachment to ensure answer quality.

Case Demonstration (Sample User Questions)

User Question 1: "I would like to know the refund process for Disney tickets."
- Result: Multiple text chunks are matched, containing information such as the refund policy; the LLM generates a detailed explanation of the refund process based on these text chunks.
User Question 2: "What is the recent Halloween event poster?"
- Result: Matched a text chunk and an image chunk (2-Halloween.jpeg). Upon detecting the keyword "poster," the LLM combines the text information to describe the Halloween event and includes a link to the image.
User Question 3: "My car was scratched. Can you see the video?"
- Result: Matched a text chunk and a video chunk (URL of the car scrape video). Acting as a Disney customer service representative, the LLM explains that it cannot view the surveillance footage but directs the user to contact the customer service center, providing the relevant video link (if applicable).

Chunking Strategy

Knowledge chunking is a core component of RAG systems, directly impacting retrieval quality and answer accuracy.

Common Chunking Strategies

Fixed-Length Chunking (with Overlap)

Code
_Core Approach*_: Split text into fixed-length segments, prioritizing sentence boundaries to avoid breaking sentences. Segments typically overlap by a fixed length. _Pros and Cons*_: Simple to implement, fast processing speed, and uniform length; suitable for scenarios requiring consistent processing lengths and batch processing, such as technical documentation and specifications. Not suitable for semantically sensitive Q&A. _Code Example*_: (Implemented via `1-fixed-length-slicing.py`)
Sentence-Boundary Slicing (No Overlap)

Code
_Core Approach*_: Based on natural language processing, text is split according to semantic units such as sentences and paragraphs. This preserves semantic integrity, avoids breaking sentences in the middle, and ensures that each segment is a complete semantic unit. _Advantages and Disadvantages*_: Good semantic preservation and high retrieval accuracy; however, slice lengths may be uneven. Suitable for natural language text and question-answering systems. Not suitable for documents with many long sentences. _Code Example*_: (Implemented via `2-sentence-boundary-slicing.py`)
LLM Semantic Slicing

Code
_Core Approach*_: Leverages the LLM’s semantic understanding capabilities to achieve precise length control while maintaining semantic integrity. The LLM can intelligently select segmentation points. _Pros and Cons*_: Strong semantic understanding and intelligent segmentation point selection; however, it relies on GPUs and is relatively costly. Suitable for projects with high-quality requirements, complex semantic structures, and sufficient budget support. Not suitable for large-scale document processing or cost-sensitive scenarios. _Code Example*_: ```python prompt = f""" Please slice the following text while preserving semantic integrity, with each slice not exceeding {max_chunk_size} characters. Requirements: 1. Maintain semantic integrity 2. Split at natural breakpoints 3. Return a list of chunks in JSON format, as follows: {{ "chunks": [ "First chunk content", "Second chunk content", ... ] }} Text content: {text} Please return a list of slices in JSON format: """ # Call LLM to generate slices ```
Hierarchical Slicing

Code
_Core Concept*_: Split the document based on its hierarchical structure (headings, sections, paragraphs), treating each structural unit as an independent block. _Pros and Cons*_: Preserves document structure, facilitates understanding of logical relationships, and supports hierarchical queries; however, it relies on document formatting. Suitable for structured documents (manuals, specifications, API documentation). Not suitable for untitled plain text. _Code Example*_: (Implemented via `4-Hierarchical-Slicing.py`)
Sliding Window Slicing

Code
_Core Concept*_: A fixed-size window is slid across the text to generate overlapping slices. The overlapping mechanism ensures contextual continuity, reduces information loss, and improves retrieval recall. _Pros and Cons*_: Maintains contextual continuity and reduces information loss; however, it generates a large amount of overlapping content. Suitable for scenarios requiring overlapping information, processing long documents, or maintaining context. Not suitable for scenarios involving sensitive data storage or where high deduplication is required. _Code Example*_: (Implemented via `5-Sliding Window Slicing.py`)

Summary of Slicing Strategy Comparison

Scenario Selection Recommendations

General Scenarios: Fixed-length slices (simple and reliable)
Technical Documentation: Hierarchical Slicing (Preserves Structure)
High-quality requirements: LLM semantic slicing (best performance)
Long Document Retrieval: Sliding Window Slicing (No Information Loss)

Key Points

Gemini’s Native Multimodal Capabilities: With the ability to uniformly understand, reason about, and generate text, images, videos, and other multimodal data, Gemini serves as the foundation for building advanced RAG applications.
Core Role of Multimodal Embedding: Maps data from different modalities into a unified vector space, enabling cross-modal semantic retrieval and similarity calculations, thereby simplifying the implementation of multimodal RAG.
RAG Assistant Workflow: Covers the complete process, including data processing (parsing, segmentation), vectorization (embedding, FAISS indexing), retrieval (unified retrieval, intent detection, filtering), and generation (LLM answer construction).
The Importance and Diversity of Slicing Strategies: Different knowledge slicing strategies (fixed length, sentence boundaries, LLM semantics, hierarchical, sliding window) each have their own advantages and disadvantages. The most suitable strategy must be selected based on specific scenarios and requirements to improve retrieval quality and answer accuracy.
Unified Indexing and Post-Filtering Mechanism: By constructing a single multimodal vector index and combining it with intent detection and result filtering, we achieve efficient and precise multimodal information retrieval while avoiding the complexity of maintaining multiple indexing systems.

embed

03 RAG Multimodal Data Processing

Gemini Multimodal Processing

Gemini API Usage Examples

CASE: Disney RAG Assistant

Technology Selection (Solution 2 - Using Multimodal Embedding)

Using Multimodal-Embedding

Format Processing Related

Faiss Index Construction (4-disney_build_index.py)

Query Processing (5-disney_query.py)

Unified Multimodal Vector Space and Retrieval Strategy

Case Demonstration (Sample User Questions)

Chunking Strategy

Common Chunking Strategies

Summary of Slicing Strategy Comparison

Key Points

Attachments

Using `Multimodal-Embedding`