LLM EngineeringMay 20, 2026

01 Embeddings and Vector Databases

RAGLLMPython

What Is Embedding

Embedding is a dimension reduction technique designed to transform various features (such as text, images, etc.) into vectors of a fixed dimension, thereby mapping discrete data into a low-dimensional, dense mathematical space and capturing its semantic information.

Core Concepts

  • Dimension Reduction: Converting high-dimensional, sparse discrete variables (such as one-hot encoding) into dense vectors of a fixed size.
  • Semantic Similarity: In the vector space, objects with similar semantics are closer in vector distance (e.g., cosine similarity).
  • Computability: Vectors can undergo mathematical operations to enable semantic reasoning.

Calculation of Cosine Similarity

Cosine similarity is used to measure the similarity in direction between two vectors, with a range of [-1, 1].

  • 1 indicates that the directions are exactly the same (high similarity)
  • 0 indicates orthogonal directions (no correlation)
  • -1 indicates completely opposite directions (highly dissimilar)

N-Gram

N-gram is a simple text feature extraction method based on the assumption that the occurrence of the nth word is related only to the preceding n-1 words.

  • Unigram (N=1): A single word
  • Bigram (N=2): A combination of two consecutive words
  • Trigram (N=3): A combination of three consecutive words

Code Example: N-Gram Word Frequency Count

Python
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer import matplotlib.pyplot as plt df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1") def get_top_n_words(corpus, n=1, k=None): vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True) return words_freq[:k] common_words_bigram = get_top_n_words(df['desc'], 2, 20) df_bigram = pd.DataFrame(common_words_bigram, columns=['desc', 'count']) df_bigram.groupby('desc').sum()['count'].sort_values(ascending=True).plot( kind='barh', title='Top20 Bigram' ) plt.show()

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the importance of a term in a specific document within a document collection.

  • TF (Term Frequency): Number of times a word appears in a document / Total number of words in the document
  • IDF (Inverse Document Frequency): log(total number of documents / number of documents containing the word + 1)

Code Example: TF-IDF Feature Extraction and Cosine Similarity

Python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import linear_kernel tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english') tfidf_matrix = tf.fit_transform(df['desc_clean']) cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) def recommendations(name, cosine_similarities=cosine_similarities): idx = indices[name] score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False) top_10_indexes = list(score_series.iloc[1:11].index) return [list(df.name)[i] for i in top_10_indexes]

Word Embedding

Word embedding maps words to a low-dimensional, dense, continuous vector space, where semantically similar words are closer together in the vector space.

Core Features

  • Dimension Reduction: Converts high-dimensional one-hot vectors into embedding vectors of a fixed dimension
  • Capturing semantic information: Vectors capture the semantic and syntactic information of words
  • Computability: For example, king - man + woman ≈ queen

Word2Vec

Word2Vec has two main modes:

  1. Skip-Gram: Given an input word, predict its context words
  2. CBOW (Continuous Bag-of-Words): Given a context word, predict the input word

Training Word2Vec using Gensim

Python
from gensim.models import word2vec import multiprocessing sentences = word2vec.PathLineSentences('./segment') model = word2vec.Word2Vec( sentences, vector_size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() ) print(model.wv.similarity('孙悟空', '猪八戒')) model.save('./models/word2Vec.model')

Choosing an Embedding Model

MTEB Leaderboard

MTEB (Massive Text Embedding Benchmark) is a comprehensive evaluation benchmark for text embedding models, covering 8 major task categories and 58 datasets, including classification, clustering, retrieval, re-ranking, STS, and other task types.

The Impact of Vector Dimension on Model Performance

  • High Dimensions (1024, 4096): Rich in semantic information, but with higher computational and storage costs
  • Low dimensions (256, 512): Faster computation and lower memory requirements, suitable for resource-constrained scenarios

Matryoshka Representation Learning (MRL)

The Jina-embeddings model employs MRL technology to generate vectors with "Matryoshka doll" characteristics:

  • Generates the most complete high-dimensional vectors internally (e.g., 2048 dimensions)
  • The prefix sub-vectors (e.g., 128/256/512 dimensions) can be used independently
  • Specify dimensions on demand via the embedding_size parameter

Model Selection Recommendations

ScenariosRecommendation
Single Chinese scenarioMonolingual models such as BGE-large-zh
Multilingual Cross-language Retrievalm3e-base, multilingual-e5-large
Pursuing Peak PerformanceReferencing the MTEB leaderboard, we constructed a gold standard test set for evaluation

Vector databases

Vector databases store and query high-dimensional vector embeddings derived from unstructured data, where the distance between vectors represents the semantic similarity of the original data.

Core Value

  • Provides long-term memory for large models, addressing the limitations of context windows
  • Enable semantic search within private knowledge bases
  • Empowers applications such as recommendation systems and image search

Comparison of Common Vector Databases

DatabaseFeaturesUse Cases
FAISSDeveloped by Meta AI, pure algorithm library, GPU-acceleratedResearch scenarios, requires deep integration
MilvusOpen-source, cloud-native, highly scalableEnterprise-grade, massive-scale data
PineconeFully managed, serverlessQuick deployment, low maintenance
WeaviateBuilt-in vectorization moduleSimplified ETL Process
QdrantBuilt with Rust, complex filteringExtreme performance requirements
ElasticsearchHybrid searchText + vector hybrid scenarios

FAISS Use Case: Embedding and Metadata Import

Python
import numpy as np import faiss from openai import OpenAI client = OpenAI( api_key=os.getenv("DASHSCOPE_API_KEY"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" ) documents = [ {"id": "doc1", "text": "迪士尼门票退票政策...", "metadata": {"category": "退票政策"}}, {"id": "doc2", "text": "年卡用户享受折扣...", "metadata": {"category": "会员权益"}}, ] # 生成向量 metadata_store = [] vectors_list = [] for i, doc in enumerate(documents): result = client.embeddings.create( model="text-embedding-v4", input=doc["text"], dimensions=1024, encoding_format="float" ) vectors_list.append(result.data[0].embedding) metadata_store.append(doc) vectors_np = np.array(vectors_list).astype('float32') vector_ids_np = np.arange(len(vectors_list)) # 构建FAISS索引(使用IndexIDMap关联元数据) index = faiss.IndexIDMap(faiss.IndexFlatL2(1024)) index.add_with_ids(vectors_np, vector_ids_np) # 搜索 query = client.embeddings.create(model="text-embedding-v4", input="退款流程", dimensions=1024, encoding_format="float") query_vec = np.array([query.data[0].embedding]).astype('float32') distances, ids = index.search(query_vec, k=3) for i, doc_id in enumerate(ids[0]): print(f"排名{i+1}: {metadata_store[doc_id]['text']} (距离: {distances[0][i]:.4f})")

FAISS does not store metadata; a "lookup table" (Redis/PostgreSQL/MongoDB) must be maintained externally to associate the original data with vector IDs.

Key Points

  • Embedding is a technique that transforms unstructured data (text, images) into high-dimensional vectors; semantically similar content is closer together in the vector space
  • TF-IDF + N-Gram are traditional feature extraction methods that produce sparse matrices; neural network methods such as Word2Vec generate dense vectors with stronger semantic representation
  • When selecting an embedding model, refer to the MTEB leaderboard and evaluate it on your own gold test set; do not rely solely on leaderboard rankings
  • Vector databases provide LLMs with long-term memory and the ability to retrieve information from private knowledge bases; FAISS is suitable for research, while Milvus and Qdrant are suitable for production
  • FAISS does not store metadata itself; IndexIDMap must be used to associate vector IDs with external metadata storage (Redis/DB).

Attachments