Milvus Lite 快速入门
向量(Vectors)是神经网络模型的输出数据格式,可以高效编码信息,在知识库、语义搜索、Retrieval Augmented Generation (RAG) 等 AI 应用中起到关键作用。
Milvus 是一个开源向量数据库,适用于各种规模的 AI 应用,从 Jupyter notebook 里的演示聊天机器人到服务数十亿用户的 Web 级搜索。在本指南中,我们将带您几分钟内本地搭建 Milvus,并用 Python 客户端库生成、存储和搜索向量。
安装 Milvus
本指南使用 Milvus Lite,这是包含在 pymilvus
中的 Python 库,可嵌入到客户端应用中。Milvus 也支持 Docker 和 Kubernetes 部署以满足生产需求。
开始前,请确保本地环境已安装 Python 3.8+。安装包含 Python 客户端和 Milvus Lite 的 pymilvus
:
$ pip install -U pymilvus
如果您在 Google Colab 上运行,为了让新安装的依赖生效,可能需要重启运行环境。(点击顶部菜单"Runtime",选择"Restart session")
设置向量数据库
要创建本地 Milvus 向量数据库,只需实例化一个 MilvusClient
并指定用于存储所有数据的文件名,如 "milvus_demo.db"。
from pymilvus import MilvusClient
client = MilvusClient("milvus_demo.db")
创建 Collection
在 Milvus 中,需要创建 Collection 来存储向量及其元数据。可以类比为传统 SQL 数据库中的表。创建 Collection 时,可以定义 schema 和 index 参数来配置向量维度、索引类型和距离度量。这里我们只关注基础用法,尽量使用默认值。最少只需设置 Collection 名称和向量字段的维度。
if client.has_collection(collection_name="demo_collection"):
client.drop_collection(collection_name="demo_collection")
client.create_collection(
collection_name="demo_collection",
dimension=768, # 本示例用到的向量维度为 768
)
说明:
- 主键和向量字段使用默认名称("id" 和 "vector")。
- 距离度量(metric type)默认为 COSINE。
- 主键字段为整数,不自动递增(未启用 auto-id)。 如需自定义 schema,可参考官方文档。
准备数据
本指南用向量实现文本语义搜索。我们需要通过下载嵌入模型为文本生成向量,这可以通过 pymilvus[model]
库的工具函数轻松实现。
用向量表示文本
首先安装模型库。该包包含 PyTorch 等必要的机器学习工具。如果本地环境未安装 PyTorch,下载过程可能较慢。
$ pip install "pymilvus[model]"
用默认模型生成向量嵌入。Milvus 期望插入的数据为字典列表,每个字典为一条数据(entity)。
from pymilvus import model
# 如果无法连接 https://huggingface.co/,可取消注释以下代码使用镜像
# import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
# 这会下载一个小型嵌入模型 "paraphrase-albert-small-v2"(约 50MB)。
embedding_fn = model.DefaultEmbeddingFunction()
# 待搜索的文本
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
vectors = embedding_fn.encode_documents(docs)
# 输出向量为 768 维,与刚才创建的 collection 匹配
print("Dim:", embedding_fn.dim, vectors[0].shape) # Dim: 768 (768,)
# 每个 entity 包含 id、向量、原始文本和 subject 标签(后续用于元数据过滤演示)
data = [
{"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
for i in range(len(vectors))
]
print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))
Dim: 768 (768,)
Data has 3 entities, each with fields: dict_keys(['id', 'vector', 'text', 'subject'])
Vector dim: 768
[可选] 用随机向量模拟嵌入
如果因网络问题无法下载模型,可用随机向量模拟文本嵌入,示例仍可运行,但搜索结果不具备语义意义。
import random
# 待搜索的文本
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
# 用随机向量模拟表示(768维)
vectors = [[random.uniform(-1, 1) for _ in range(768)] for _ in docs]
data = [
{"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
for i in range(len(vectors))
]
print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))
Data has 3 entities, each with fields: dict_keys(['id', 'vector', 'text', 'subject'])
Vector dim: 768
插入数据
将数据插入 Collection:
res = client.insert(collection_name="demo_collection", data=data)
print(res)
{'insert_count': 3, 'ids': [0, 1, 2], 'cost': 0}
语义搜索
现在可以通过将查询文本转为向量,在 Milvus 上进行向量相似度搜索。
向量搜索
Milvus 支持同时进行一个或多个向量搜索。query_vectors
变量为向量列表,每个向量为浮点数组。
query_vectors = embedding_fn.encode_queries(["Who is Alan Turing?"])
# 如果没有 embedding function,可用随机向量完成演示:
# query_vectors = [ [ random.uniform(-1, 1) for _ in range(768) ] ]
res = client.search(
collection_name="demo_collection", # 目标 collection
data=query_vectors, # 查询向量
limit=2, # 返回实体数量
output_fields=["text", "subject"], # 指定返回字段
)
print(res)
data: ["[{'id': 2, 'distance': 0.5859944820404053, 'entity': {'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}}, {'id': 1, 'distance': 0.5118255615234375, 'entity': {'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}}]"] , extra_info: {'cost': 0}
输出为结果列表,每个查询对应一个结果列表,每个结果包含实体主键、与查询向量的距离和指定的 output_fields
字段。
带元数据过滤的向量搜索
还可以结合元数据(Milvus 中称为"标量字段")进行向量搜索。通过 filter 表达式指定过滤条件。如下例用 subject
字段过滤:
# 插入更多 biology 主题的文本
docs = [
"Machine learning has been used for drug design.",
"Computational synthesis with AI algorithms predicts molecular properties.",
"DDR1 is involved in cancers and fibrosis.",
]
vectors = embedding_fn.encode_documents(docs)
data = [
{"id": 3 + i, "vector": vectors[i], "text": docs[i], "subject": "biology"}
for i in range(len(vectors))
]
client.insert(collection_name="demo_collection", data=data)
# 只返回 subject 为 biology 的文本
res = client.search(
collection_name="demo_collection",
data=embedding_fn.encode_queries(["tell me AI related information"]),
filter="subject == 'biology'",
limit=2,
output_fields=["text", "subject"],
)
print(res)
data: ["[{'id': 4, 'distance': 0.27030569314956665, 'entity': {'text': 'Computational synthesis with AI algorithms predicts molecular properties.', 'subject': 'biology'}}, {'id': 3, 'distance': 0.16425910592079163, 'entity': {'text': 'Machine learning has been used for drug design.', 'subject': 'biology'}}]"] , extra_info: {'cost': 0}
默认情况下,标量字段未建立索引。如果需要在大数据集上进行元数据过滤搜索,可考虑使用固定 schema 并开启 索引 以提升性能。
除了向量搜索,还可以执行其他类型的查询:
Query
query()
用于检索所有满足条件的实体,如 filter 表达式 或指定 id。
例如,检索所有 subject 为 history 的实体:
res = client.query(
collection_name="demo_collection",
filter="subject == 'history'",
output_fields=["text", "subject"],
)
按主键直接检索实体:
res = client.query(
collection_name="demo_collection",
ids=[0, 2],
output_fields=["vector", "text", "subject"],
)
删除实体
res = client.delete(
collection_name="demo_collection",
filter="subject == 'history'",
)
print(res)
[0, 2]
[3, 4, 5]
Load Existing Data
Since all data of Milvus Lite is stored in a local file, you can load all data into memory even after the program terminates, by creating a MilvusClient
with the existing file. For example, this will recover the collections from "milvus_demo.db" file and continue to write data into it.
from pymilvus import MilvusClient
client = MilvusClient("milvus_demo.db")
Drop the collection
If you would like to delete all the data in a collection, you can drop the collection with
# Drop collection
client.drop_collection(collection_name="demo_collection")
Learn More
Milvus Lite is great for getting started with a local python program. If you have large scale data or would like to use Milvus in production, you can learn about deploying Milvus on Docker and Kubernetes. All deployment modes of Milvus share the same API, so your client side code doesn't need to change much if moving to another deployment mode. Simply specify the URI and Token of a Milvus server deployed anywhere:
client = MilvusClient(uri="http://localhost:19530", token="root:Milvus")
To migrate data from Milvus Lite to Milvus deployed on Docker or Kubernetes, refer to Migrating data from Milvus Lite.
Milvus provides REST and gRPC API, with client libraries in languages such as Python, Java, Go, C# and Node.js.