01 Embeddings and Vector Databases
Summary
This note delves into the core concepts and applications of embeddings and vector databases. Embedding involves transforming unstructured data such as text and images into high-dimensional vectors to capture their semantic information. Vector databases efficiently store and retrieve these vectors, providing LLMs with long-term memory and semantic search capabilities. The note also details the practical applications of embedding techniques such as Word2Vec and TF-IDF, as well as vector databases like FAISS.
What Is Embedding
Embedding is a dimension reduction technique designed to transform various features (such as text, images, etc.) into vectors of a fixed dimension, thereby mapping discrete data into a low-dimensional, dense mathematical space and capturing its semantic information.
Core Concepts
- Dimension Reduction: Converting high-dimensional, sparse discrete variables (such as one-hot encoding) into dense vectors of a fixed size.
- Semantic Similarity: In the vector space, objects with similar semantics are closer in vector distance (e.g., cosine similarity).
- Computability: Vectors can undergo mathematical operations to enable semantic reasoning.
Calculation of Cosine Similarity
Cosine similarity is used to measure the similarity in direction between two vectors, with a range of [-1, 1].
1indicates that the directions are exactly the same (high similarity)0indicates orthogonal directions (no correlation)-1indicates completely opposite directions (highly dissimilar)
N-Gram
N-gram is a simple text feature extraction method based on the assumption that the occurrence of the nth word is related only to the preceding n-1 words.
- Unigram (N=1): A single word
- Bigram (N=2): A combination of two consecutive words
- Trigram (N=3): A combination of three consecutive words
Code Example: N-Gram Word Frequency Count
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
def get_top_n_words(corpus, n=1, k=None):
vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
return words_freq[:k]
common_words_bigram = get_top_n_words(df['desc'], 2, 20)
df_bigram = pd.DataFrame(common_words_bigram, columns=['desc', 'count'])
df_bigram.groupby('desc').sum()['count'].sort_values(ascending=True).plot(
kind='barh', title='Top20 Bigram'
)
plt.show()
TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the importance of a term in a specific document within a document collection.
- TF (Term Frequency): Number of times a word appears in a document / Total number of words in the document
- IDF (Inverse Document Frequency): log(total number of documents / number of documents containing the word + 1)
Code Example: TF-IDF Feature Extraction and Cosine Similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['desc_clean'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
def recommendations(name, cosine_similarities=cosine_similarities):
idx = indices[name]
score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)
top_10_indexes = list(score_series.iloc[1:11].index)
return [list(df.name)[i] for i in top_10_indexes]
Word Embedding
Word embedding maps words to a low-dimensional, dense, continuous vector space, where semantically similar words are closer together in the vector space.
Core Features
- Dimension Reduction: Converts high-dimensional one-hot vectors into embedding vectors of a fixed dimension
- Capturing semantic information: Vectors capture the semantic and syntactic information of words
- Computability: For example,
king - man + woman ≈ queen
Word2Vec
Word2Vec has two main modes:
- Skip-Gram: Given an input word, predict its context words
- CBOW (Continuous Bag-of-Words): Given a context word, predict the input word
Training Word2Vec using Gensim
from gensim.models import word2vec
import multiprocessing
sentences = word2vec.PathLineSentences('./segment')
model = word2vec.Word2Vec(
sentences,
vector_size=100,
window=5,
min_count=5,
workers=multiprocessing.cpu_count()
)
print(model.wv.similarity('孙悟空', '猪八戒'))
model.save('./models/word2Vec.model')
Choosing an Embedding Model
MTEB Leaderboard
MTEB (Massive Text Embedding Benchmark) is a comprehensive evaluation benchmark for text embedding models, covering 8 major task categories and 58 datasets, including classification, clustering, retrieval, re-ranking, STS, and other task types.
The Impact of Vector Dimension on Model Performance
- High Dimensions (1024, 4096): Rich in semantic information, but with higher computational and storage costs
- Low dimensions (256, 512): Faster computation and lower memory requirements, suitable for resource-constrained scenarios
Matryoshka Representation Learning (MRL)
The Jina-embeddings model employs MRL technology to generate vectors with "Matryoshka doll" characteristics:
- Generates the most complete high-dimensional vectors internally (e.g., 2048 dimensions)
- The prefix sub-vectors (e.g., 128/256/512 dimensions) can be used independently
- Specify dimensions on demand via
theembedding_sizeparameter
Model Selection Recommendations
| Scenarios | Recommendation |
|---|---|
| Single Chinese scenario | Monolingual models such as BGE-large-zh |
| Multilingual Cross-language Retrieval | m3e-base, multilingual-e5-large |
| Pursuing Peak Performance | Referencing the MTEB leaderboard, we constructed a gold standard test set for evaluation |
Vector databases
Vector databases store and query high-dimensional vector embeddings derived from unstructured data, where the distance between vectors represents the semantic similarity of the original data.
Core Value
- Provides long-term memory for large models, addressing the limitations of context windows
- Enable semantic search within private knowledge bases
- Empowers applications such as recommendation systems and image search
Comparison of Common Vector Databases
| Database | Features | Use Cases |
|---|---|---|
| FAISS | Developed by Meta AI, pure algorithm library, GPU-accelerated | Research scenarios, requires deep integration |
| Milvus | Open-source, cloud-native, highly scalable | Enterprise-grade, massive-scale data |
| Pinecone | Fully managed, serverless | Quick deployment, low maintenance |
| Weaviate | Built-in vectorization module | Simplified ETL Process |
| Qdrant | Built with Rust, complex filtering | Extreme performance requirements |
| Elasticsearch | Hybrid search | Text + vector hybrid scenarios |
FAISS Use Case: Embedding and Metadata Import
import numpy as np
import faiss
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
documents = [
{"id": "doc1", "text": "迪士尼门票退票政策...", "metadata": {"category": "退票政策"}},
{"id": "doc2", "text": "年卡用户享受折扣...", "metadata": {"category": "会员权益"}},
]
# 生成向量
metadata_store = []
vectors_list = []
for i, doc in enumerate(documents):
result = client.embeddings.create(
model="text-embedding-v4",
input=doc["text"],
dimensions=1024,
encoding_format="float"
)
vectors_list.append(result.data[0].embedding)
metadata_store.append(doc)
vectors_np = np.array(vectors_list).astype('float32')
vector_ids_np = np.arange(len(vectors_list))
# 构建FAISS索引(使用IndexIDMap关联元数据)
index = faiss.IndexIDMap(faiss.IndexFlatL2(1024))
index.add_with_ids(vectors_np, vector_ids_np)
# 搜索
query = client.embeddings.create(model="text-embedding-v4", input="退款流程", dimensions=1024, encoding_format="float")
query_vec = np.array([query.data[0].embedding]).astype('float32')
distances, ids = index.search(query_vec, k=3)
for i, doc_id in enumerate(ids[0]):
print(f"排名{i+1}: {metadata_store[doc_id]['text']} (距离: {distances[0][i]:.4f})")
FAISS does not store metadata; a "lookup table" (Redis/PostgreSQL/MongoDB) must be maintained externally to associate the original data with vector IDs.
Key Points
- Embedding is a technique that transforms unstructured data (text, images) into high-dimensional vectors; semantically similar content is closer together in the vector space
- TF-IDF + N-Gram are traditional feature extraction methods that produce sparse matrices; neural network methods such as Word2Vec generate dense vectors with stronger semantic representation
- When selecting an embedding model, refer to the MTEB leaderboard and evaluate it on your own gold test set; do not rely solely on leaderboard rankings
- Vector databases provide LLMs with long-term memory and the ability to retrieve information from private knowledge bases; FAISS is suitable for research, while Milvus and Qdrant are suitable for production
- FAISS does not store metadata itself;
IndexIDMapmust be used to associate vector IDs with external metadata storage (Redis/DB).