01 Embeddings and Vector Databases

What Is Embedding

Embedding is a dimension reduction technique designed to transform various features (such as text, images, etc.) into vectors of a fixed dimension, thereby mapping discrete data into a low-dimensional, dense mathematical space and capturing its semantic information.

Core Concepts

Dimension Reduction: Converting high-dimensional, sparse discrete variables (such as one-hot encoding) into dense vectors of a fixed size.
Semantic Similarity: In the vector space, objects with similar semantics are closer in vector distance (e.g., cosine similarity).
Computability: Vectors can undergo mathematical operations to enable semantic reasoning.

Calculation of Cosine Similarity

Cosine similarity is used to measure the similarity in direction between two vectors, with a range of [-1, 1].

1 indicates that the directions are exactly the same (high similarity)
0 indicates orthogonal directions (no correlation)
-1 indicates completely opposite directions (highly dissimilar)

N-Gram

N-gram is a simple text feature extraction method based on the assumption that the occurrence of the nth word is related only to the preceding n-1 words.

Unigram (N=1): A single word
Bigram (N=2): A combination of two consecutive words
Trigram (N=3): A combination of three consecutive words

Code Example: N-Gram Word Frequency Count

Python

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")

def get_top_n_words(corpus, n=1, k=None):
    vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:k]

common_words_bigram = get_top_n_words(df['desc'], 2, 20)
df_bigram = pd.DataFrame(common_words_bigram, columns=['desc', 'count'])
df_bigram.groupby('desc').sum()['count'].sort_values(ascending=True).plot(
    kind='barh', title='Top20 Bigram'
)
plt.show()

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the importance of a term in a specific document within a document collection.

TF (Term Frequency): Number of times a word appears in a document / Total number of words in the document
IDF (Inverse Document Frequency): log(total number of documents / number of documents containing the word + 1)

Code Example: TF-IDF Feature Extraction and Cosine Similarity

Python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['desc_clean'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

def recommendations(name, cosine_similarities=cosine_similarities):
    idx = indices[name]
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)
    top_10_indexes = list(score_series.iloc[1:11].index)
    return [list(df.name)[i] for i in top_10_indexes]

Word Embedding

Word embedding maps words to a low-dimensional, dense, continuous vector space, where semantically similar words are closer together in the vector space.

Core Features

Dimension Reduction: Converts high-dimensional one-hot vectors into embedding vectors of a fixed dimension
Capturing semantic information: Vectors capture the semantic and syntactic information of words
Computability: For example, king - man + woman ≈ queen

Word2Vec

Word2Vec has two main modes:

Skip-Gram: Given an input word, predict its context words
CBOW (Continuous Bag-of-Words): Given a context word, predict the input word

Training Word2Vec using Gensim

Python

from gensim.models import word2vec
import multiprocessing

sentences = word2vec.PathLineSentences('./segment')
model = word2vec.Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    min_count=5,
    workers=multiprocessing.cpu_count()
)

print(model.wv.similarity('孙悟空', '猪八戒'))
model.save('./models/word2Vec.model')

Choosing an Embedding Model

MTEB Leaderboard

MTEB (Massive Text Embedding Benchmark) is a comprehensive evaluation benchmark for text embedding models, covering 8 major task categories and 58 datasets, including classification, clustering, retrieval, re-ranking, STS, and other task types.

The Impact of Vector Dimension on Model Performance

High Dimensions (1024, 4096): Rich in semantic information, but with higher computational and storage costs
Low dimensions (256, 512): Faster computation and lower memory requirements, suitable for resource-constrained scenarios

Matryoshka Representation Learning (MRL)

The Jina-embeddings model employs MRL technology to generate vectors with "Matryoshka doll" characteristics:

Generates the most complete high-dimensional vectors internally (e.g., 2048 dimensions)
The prefix sub-vectors (e.g., 128/256/512 dimensions) can be used independently
Specify dimensions on demand via the embedding_size parameter

Model Selection Recommendations

Scenarios	Recommendation
Single Chinese scenario	Monolingual models such as BGE-large-zh
Multilingual Cross-language Retrieval	m3e-base, multilingual-e5-large
Pursuing Peak Performance	Referencing the MTEB leaderboard, we constructed a gold standard test set for evaluation

Vector databases

Vector databases store and query high-dimensional vector embeddings derived from unstructured data, where the distance between vectors represents the semantic similarity of the original data.

Core Value

Provides long-term memory for large models, addressing the limitations of context windows
Enable semantic search within private knowledge bases
Empowers applications such as recommendation systems and image search

Comparison of Common Vector Databases

Database	Features	Use Cases
FAISS	Developed by Meta AI, pure algorithm library, GPU-accelerated	Research scenarios, requires deep integration
Milvus	Open-source, cloud-native, highly scalable	Enterprise-grade, massive-scale data
Pinecone	Fully managed, serverless	Quick deployment, low maintenance
Weaviate	Built-in vectorization module	Simplified ETL Process
Qdrant	Built with Rust, complex filtering	Extreme performance requirements
Elasticsearch	Hybrid search	Text + vector hybrid scenarios

FAISS Use Case: Embedding and Metadata Import

Python

import numpy as np
import faiss
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

documents = [
    {"id": "doc1", "text": "迪士尼门票退票政策...", "metadata": {"category": "退票政策"}},
    {"id": "doc2", "text": "年卡用户享受折扣...", "metadata": {"category": "会员权益"}},
]

# 生成向量
metadata_store = []
vectors_list = []
for i, doc in enumerate(documents):
    result = client.embeddings.create(
        model="text-embedding-v4",
        input=doc["text"],
        dimensions=1024,
        encoding_format="float"
    )
    vectors_list.append(result.data[0].embedding)
    metadata_store.append(doc)

vectors_np = np.array(vectors_list).astype('float32')
vector_ids_np = np.arange(len(vectors_list))

# 构建FAISS索引（使用IndexIDMap关联元数据）
index = faiss.IndexIDMap(faiss.IndexFlatL2(1024))
index.add_with_ids(vectors_np, vector_ids_np)

# 搜索
query = client.embeddings.create(model="text-embedding-v4", input="退款流程", dimensions=1024, encoding_format="float")
query_vec = np.array([query.data[0].embedding]).astype('float32')
distances, ids = index.search(query_vec, k=3)

for i, doc_id in enumerate(ids[0]):
    print(f"排名{i+1}: {metadata_store[doc_id]['text']} (距离: {distances[0][i]:.4f})")

FAISS does not store metadata; a "lookup table" (Redis/PostgreSQL/MongoDB) must be maintained externally to associate the original data with vector IDs.

Key Points

Embedding is a technique that transforms unstructured data (text, images) into high-dimensional vectors; semantically similar content is closer together in the vector space
TF-IDF + N-Gram are traditional feature extraction methods that produce sparse matrices; neural network methods such as Word2Vec generate dense vectors with stronger semantic representation
When selecting an embedding model, refer to the MTEB leaderboard and evaluate it on your own gold test set; do not rely solely on leaderboard rankings
Vector databases provide LLMs with long-term memory and the ability to retrieve information from private knowledge bases; FAISS is suitable for research, while Milvus and Qdrant are suitable for production
FAISS does not store metadata itself; IndexIDMap must be used to associate vector IDs with external metadata storage (Redis/DB).

01 Embeddings and Vector Databases

What Is Embedding

Core Concepts

Calculation of Cosine Similarity

N-Gram

TF-IDF

Word Embedding

Core Features

Word2Vec

Choosing an Embedding Model

MTEB Leaderboard

The Impact of Vector Dimension on Model Performance

Matryoshka Representation Learning (MRL)

Model Selection Recommendations

Vector databases

Core Value

Comparison of Common Vector Databases

FAISS Use Case: Embedding and Metadata Import

Key Points

Attachments