使用 Milvus、vLLM 和 Llama 3.1 构建 RAG

加州大学伯克利分校将 vLLM（一个快速且易于使用的 LLM 推理和服务库）于 2024 年 7 月捐赠给 LF AI & Data Foundation 作为孵化阶段项目。作为一个伙伴成员项目，我们欢迎 vLLM 加入 LF AI & Data 大家庭！🎉

大语言模型（LLMs）和向量数据库通常配对构建检索增强生成（RAG），这是一种流行的 AI 应用架构，用于解决 AI 幻觉问题。本博客将向您展示如何使用 Milvus、vLLM 和 Llama 3.1 构建和运行 RAG。更具体地说，我将向您展示如何在 Milvus 中嵌入和存储文本信息作为向量嵌入，并使用此向量存储作为知识库来高效检索与用户问题相关的文本块。最后，我们将利用 vLLM 来服务 Meta 的 Llama 3.1-8B 模型，生成由检索到的文本增强的答案。让我们深入了解！

Milvus、vLLM 和 Meta 的 Llama 3.1 介绍

Milvus 向量数据库

Milvus 是一个开源的、专门构建的分布式向量数据库，用于存储、索引和搜索生成式 AI（GenAI）工作负载的向量。其执行混合搜索、元数据过滤、重新排序以及高效处理万亿级向量的能力使 Milvus 成为 AI 和机器学习工作负载的首选。Milvus 可以在本地运行、在集群上运行，或在完全托管的 Zilliz Cloud 中托管。

vLLM

vLLM 是一个在 UC Berkeley SkyLab 启动的开源项目，专注于优化 LLM 服务性能。它使用 PagedAttention 的高效内存管理、连续批处理和优化的 CUDA 内核。与传统方法相比，vLLM 将服务性能提高了高达 24 倍，同时将 GPU 内存使用量减半。

根据论文"Efficient Memory Management for Large Language Model Serving with PagedAttention"，KV 缓存使用大约 30% 的 GPU 内存，导致潜在的内存问题。KV 缓存存储在连续内存中，但大小变化可能导致内存碎片，这对计算来说是低效的。

图 1. 现有系统中的 KV 缓存内存管理（2023 Paged Attention 论文）

通过使用 KV 缓存的虚拟内存，vLLM 仅在需要时分配物理 GPU 内存，消除内存碎片并避免预分配。在测试中，vLLM 的性能优于 HuggingFace Transformers（HF）和 Text Generation Inference（TGI），在 NVIDIA A10G 和 A100 GPU 上实现了比 HF 高达 24 倍的吞吐量，比 TGI 高 3.5 倍的吞吐量。

图 2. 当每个请求要求三个并行输出完成时的服务吞吐量。vLLM 实现了比 HF 高 8.5x—15x 的吞吐量，比 TGI 高 3.3x—3.5x 的吞吐量（2023 vLLM 博客）。

Meta 的 Llama 3.1

Meta 的 Llama 3.1 于 2024 年 7 月 23 日发布。405B 模型在几个公共基准测试中提供了最先进的性能，并具有 128,000 个输入标记的上下文窗口，允许各种商业用途。除了 4050 亿参数模型外，Meta 还发布了 Llama3 70B（700 亿参数）和 8B（80 亿参数）的更新版本。模型权重可以在 Meta 的网站上下载。

一个关键洞察是，微调生成的数据可以提高性能，但低质量的示例可能会降低性能。Llama 团队广泛工作以识别和删除这些坏示例，使用模型本身、辅助模型和其他工具。

使用 Milvus 构建和执行 RAG-检索

准备您的数据集

我使用官方 Milvus 文档作为本演示的数据集，我下载并保存在本地。

from langchain.document_loaders import DirectoryLoader
# 加载已保存在本地目录中的 HTML 文件
path = "../../RAG/rtdocs_new/"
global_pattern = '*.html'
loader = DirectoryLoader(path=path, glob=global_pattern)
docs = loader.load()


# 打印文档数量和预览
print(f"loaded {len(docs)} documents")
print(docs[0].page_content)
pprint.pprint(docs[0].metadata)

loaded 22 documents
Why Milvus Docs Tutorials Tools Blog Community Stars0 Try Managed Milvus FREE Search Home v2.4.x About ...
{'source': 'https://milvus.io/docs/quickstart.md'}

下载嵌入模型

接下来，从 HuggingFace 下载免费的开源嵌入模型。

import torch
from sentence_transformers import SentenceTransformer


# 初始化设备无关代码的 torch 设置
N_GPU = torch.cuda.device_count()
DEVICE = torch.device('cuda:N_GPU' if torch.cuda.is_available() else 'cpu')


# 从 huggingface 模型中心下载模型
model_name = "BAAI/bge-large-en-v1.5"
encoder = SentenceTransformer(model_name, device=DEVICE)


# 获取模型参数并保存以备后用
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length()


# 检查模型参数
print(f"model_name: {model_name}")
print(f"EMBEDDING_DIM: {EMBEDDING_DIM}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

model_name: BAAI/bge-large-en-v1.5
EMBEDDING_DIM: 1024
MAX_SEQ_LENGTH: 512

将您的自定义数据分块并编码为向量

我将使用 512 个字符的固定长度，重叠率为 10%。

from langchain.text_splitter import RecursiveCharacterTextSplitter


CHUNK_SIZE = 512
chunk_overlap = np.round(CHUNK_SIZE * 0.10, 0)
print(f"chunk_size: {CHUNK_SIZE}, chunk_overlap: {chunk_overlap}")


# 定义分割器
child_splitter = RecursiveCharacterTextSplitter(
   chunk_size=CHUNK_SIZE,
   chunk_overlap=chunk_overlap)


# 分块文档
chunks = child_splitter.split_documents(docs)
print(f"{len(docs)} docs split into {len(chunks)} child documents.")


# 编码器输入是 doc.page_content 作为字符串
list_of_strings = [doc.page_content for doc in chunks if hasattr(doc, 'page_content')]


# 使用 HuggingFace 编码器进行嵌入推理
embeddings = torch.tensor(encoder.encode(list_of_strings))


# 标准化嵌入
embeddings = np.array(embeddings / np.linalg.norm(embeddings))


# Milvus 期望 `numpy.float32` 数字的 `numpy.ndarray` 列表
converted_values = list(map(np.float32, embeddings))


# 创建用于 Milvus 插入的 dict_list
dict_list = []
for chunk, vector in zip(chunks, converted_values):
   # 组装嵌入向量、原始文本块、元数据
   chunk_dict = {
       'chunk': chunk.page_content,
       'source': chunk.metadata.get('source', ""),
       'vector': vector,
   }
   dict_list.append(chunk_dict)

chunk_size: 512, chunk_overlap: 51.0
22 docs split into 355 child documents.

在 Milvus 中保存向量

将编码后的向量嵌入摄取到 Milvus 向量数据库中。

# 连接客户端到 Milvus Lite 服务器
from pymilvus import MilvusClient
mc = MilvusClient("milvus_demo.db")


# 创建具有灵活模式和 AUTOINDEX 的 Collection
COLLECTION_NAME = "MilvusDocs"
mc.create_collection(COLLECTION_NAME,
       EMBEDDING_DIM,
       consistency_level="Eventually",
       auto_id=True, 
       overwrite=True)


# 将数据插入 Milvus Collection
print("Start inserting entities")
start_time = time.time()
mc.insert(
   COLLECTION_NAME,
   data=dict_list,
   progress_bar=True)


end_time = time.time()
print(f"Milvus insert time for {len(dict_list)} vectors: ", end="")
print(f"{round(end_time - start_time, 2)} seconds")

Start inserting entities
Milvus insert time for 355 vectors: 0.2 seconds

执行向量搜索

提出问题并从 Milvus 中的知识库搜索最近邻的块。

SAMPLE_QUESTION = "What do the parameters for HNSW mean?"


# 使用相同的编码器嵌入问题
query_embeddings = torch.tensor(encoder.encode(SAMPLE_QUESTION))
# 将嵌入标准化为单位长度
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
# 将嵌入转换为 np.float32 的列表的列表
query_embeddings = list(map(np.float32, query_embeddings))


# 定义您可以过滤的元数据 Field
OUTPUT_FIELDS = list(dict_list[0].keys())
OUTPUT_FIELDS.remove('vector')


# 定义您要检索多少个 top-k 结果
TOP_K = 2


# 使用您的查询和向量数据库运行语义向量搜索
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings,
    output_fields=OUTPUT_FIELDS,
    limit=TOP_K,
    consistency_level="Eventually")

检索到的结果如下所示。

Retrieved result #1
distance = 0.7001987099647522
('Chunk text: layer, finds the node closest to the target in this layer, and'
...
'outgoing')
source: https://milvus.io/docs/index.md

Retrieved result #2
distance = 0.6953287124633789
('Chunk text: this value can improve recall rate at the cost of increased'
...
'to the target')
source: https://milvus.io/docs/index.md

使用 vLLM 和 Llama 3.1-8B 构建和执行 RAG-生成

从 HuggingFace 安装 vLLM 和模型

vLLM 默认从 HuggingFace 下载大语言模型。一般来说，每当您想在 HuggingFace 上使用全新模型时，您应该执行 pip install --upgrade 或 -U。此外，您需要 GPU 来运行 Meta 的 Llama 3.1 模型与 vLLM 的推理。

有关所有 vLLM 支持模型的完整列表，请参见此文档页面。

# （推荐）创建新的 conda 环境
conda create -n myenv python=3.11 -y
conda activate myenv

# 安装带有 CUDA 12.1 的 vLLM
pip install -U vllm transformers torch

import vllm, torch
from vllm import LLM, SamplingParams

# 清除 GPU 内存缓存
torch.cuda.empty_cache()

# 检查 GPU
!nvidia-smi

要了解更多关于如何安装 vLLM 的信息，请参见其安装页面。

获取 HuggingFace token

HuggingFace 上的某些模型，如 Meta Llama 3.1，要求用户在能够下载权重之前接受其许可。因此，您必须创建 HuggingFace 账户、接受模型的许可并生成 token。

当访问 HuggingFace 上的这个 Llama3.1 页面时，您会收到一条消息要求您同意条款。点击"Accept License"接受 Meta 条款，然后才能下载模型权重。批准通常需要不到一天时间。

在您收到批准后，必须生成新的 HuggingFace token。您的旧 token 将不适用于新权限。

在安装 vLLM 之前，使用您的新 token 登录 HuggingFace。下面，我使用 Colab secrets 来存储 token。

# 使用您的新 token 登录 HuggingFace
from huggingface_hub import login
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')
login(token = hf_token, add_to_git_credential=True)

运行 RAG-生成

在演示中，我们运行 Llama-3.1-8B 模型，这需要 GPU 和相当大的内存来启动。以下示例在带有 A100 GPU 的 Google Colab Pro（$10/月）上运行。要了解更多关于如何运行 vLLM 的信息，您可以查看快速入门文档。

# 1. 选择模型
MODELTORUN = "meta-llama/Meta-Llama-3.1-8B-Instruct"


# 2. 清除 GPU 内存缓存，您将需要全部内存！
torch.cuda.empty_cache()


# 3. 实例化 vLLM 模型实例
llm = LLM(model=MODELTORUN,
         enforce_eager=True,
         dtype=torch.bfloat16,
         gpu_memory_utilization=0.5,
         max_model_len=1000,
         seed=415,
         max_num_batched_tokens=3000)

使用从 Milvus 检索的上下文和源编写提示。

# 用空格将所有上下文分开组合
contexts_combined = ' '.join(contexts)
# Lance Martin, LangChain，说把最好的上下文放在最后
contexts_combined = ' '.join(reversed(contexts))


# 用逗号将所有唯一的源组合在一起
source_combined = ' '.join(reversed(list(dict.fromkeys(sources))))


SYSTEM_PROMPT = f"""First, check if the provided Context is relevant to
the user's question.  Second, only if the provided Context is strongly relevant, answer the question using the Context.  Otherwise, if the Context is not strongly relevant, answer the question without using the Context. 
Be clear, concise, relevant.  Answer clearly, in fewer than 2 sentences.
Grounding sources: {source_combined}
Context: {contexts_combined}
User's question: {SAMPLE_QUESTION}
"""


prompts = [SYSTEM_PROMPT]

现在，使用检索到的块和塞入提示的原始问题生成答案。

# 采样参数
sampling_params = SamplingParams(temperature=0.2, top_p=0.95)


# 调用 vLLM 模型
outputs = llm.generate(prompts, sampling_params)


# 打印输出
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   # !r 调用 repr()，它在引号内打印字符串
   print()
   print(f"Question: {SAMPLE_QUESTION!r}")
   pprint.pprint(f"Generated text: {generated_text!r}")

Question: 'What do the parameters for HNSW MEAN!?'
Generated text: 'Answer: The parameters for HNSW (Hiera(rchical Navigable Small World Graph) are: '
'* M: The maximum degree of nodes on each layer oof the graph, which can improve '
'recall rate at the cost of increased search time. * efConstruction and ef: ' 
'These parameters specify a search range when building or searching an index.'

上面的答案对我来说看起来很完美！

如果您对这个演示感兴趣，欢迎自己尝试并告诉我们您的想法。您也欢迎加入我们的 Milvus Discord 社区，直接与所有 GenAI 开发者对话。

参考资料

vLLM 官方文档和模型页面。
2023 vLLM Paged Attention 论文
2023 vLLM 在 Ray Summit 的演示
vLLM 博客：vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
关于运行 vLLM 服务器的有用博客：Deploying vLLM: a Step-by-Step Guide
The Llama 3 Herd of Models | Research - AI at Meta

Milvus、vLLM 和 Meta 的 Llama 3.1 介绍​

Milvus 向量数据库​

vLLM​

Meta 的 Llama 3.1​

使用 Milvus 构建和执行 RAG-检索​

准备您的数据集​

下载嵌入模型​

将您的自定义数据分块并编码为向量​

在 Milvus 中保存向量​

执行向量搜索​

使用 vLLM 和 Llama 3.1-8B 构建和执行 RAG-生成​

从 HuggingFace 安装 vLLM 和模型​

获取 HuggingFace token​

运行 RAG-生成​

参考资料​