跳到主要内容

检索增强生成:使用 Apify 爬取网站并将数据保存到 Milvus 用于问答

Open In Colab

本教程解释了如何使用 Apify 的网站内容爬虫来爬取网站,并将数据保存到 Milvus/Zilliz 向量数据库中,以便后续用于问答。

Apify 是一个网络爬取和数据提取平台,提供一个拥有超过两千个现成云工具的应用市场,这些工具被称为 Actor。这些工具非常适合从电商网站、社交媒体、搜索引擎、在线地图等提取结构化数据等用例。

例如,网站内容爬虫 Actor 可以深度爬取网站,通过删除 cookie 模态框、页脚或导航来清理其 HTML,然后将 HTML 转换为 Markdown。

Apify 与 Milvus/Zilliz 的集成使得从网络上传数据到向量数据库变得容易。

开始之前

在运行此笔记本之前,请确保您具备以下条件:

安装依赖项

$ pip install --upgrade --quiet  apify==1.7.2 langchain-core==0.3.5 langchain-milvus==0.1.5 langchain-openai==0.2.0

设置 Apify 和 OpenAI API 密钥

import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

Enter YOUR APIFY_API_TOKEN·········· Enter YOUR OPENAI_API_KEY··········

设置 Milvus/Zilliz URI、token 和 collection 名称

您需要 Milvus/Zilliz 的 URI 和 Token 来设置客户端。

  • 如果您在 Docker 或 Kubernetes 上自部署了 Milvus 服务器,请使用服务器地址和端口作为您的 uri,例如 http://localhost:19530。如果您在 Milvus 上启用了身份验证功能,请使用"<your_username>:<your_password>"作为 token,否则将 token 留空。
  • 如果您使用 Zilliz Cloud,Milvus 的完全托管云服务,请调整 uritoken,它们对应于 Zilliz Cloud 中的公共端点和 API 密钥

请注意,collection 不需要事先存在。当数据上传到数据库时,它将自动创建。

os.environ["MILVUS_URI"] = getpass("Enter YOUR MILVUS_URI")
os.environ["MILVUS_TOKEN"] = getpass("Enter YOUR MILVUS_TOKEN")

MILVUS_COLLECTION_NAME = "apify"

Enter YOUR MILVUS_URI·········· Enter YOUR MILVUS_TOKEN··········

使用网站内容爬虫从 Milvus.io 爬取文本内容

接下来,我们将使用 Apify Python SDK 的网站内容爬虫。我们首先定义 actor_id 和 run_input,然后指定将保存到向量数据库的信息。

actor_id="apify/website-content-crawler" 是网站内容爬虫的标识符。爬虫的行为可以通过 run_input 参数完全控制(有关更多详细信息,请参阅输入页面)。在此示例中,我们将爬取 Milvus 文档,它不需要 JavaScript 渲染。因此,我们设置 crawlerType=cheerio,定义 startUrls,并通过设置 maxCrawlPages=10 来限制爬取页面的数量。

from apify_client import ApifyClient

client = ApifyClient(os.getenv("APIFY_API_TOKEN"))

actor_id = "apify/website-content-crawler"
run_input = {
"crawlerType": "cheerio",
"maxCrawlPages": 10,
"startUrls": [{"url": "https://milvus.io/"}, {"url": "https://zilliz.com/"}],
}

actor_call = client.actor(actor_id).call(run_input=run_input)

网站内容爬虫将彻底爬取网站,直到达到 maxCrawlPages 设置的预定义限制。爬取的数据将存储在 Apify 平台上的 Dataset 中。要访问和分析这些数据,您可以使用 defaultDatasetId

dataset_id = actor_call["defaultDatasetId"]
dataset_id

'P9dLFfeJAljlePWnz'

以下代码从 Apify Dataset 中获取爬取的数据并显示第一个爬取的网站

item = client.dataset(dataset_id).list_items(limit=1).items
item[0].get("text")

'The High-Performance Vector Database Built for Scale\nStart running Milvus in seconds\nfrom pymilvus import MilvusClient client = MilvusClient("milvus_demo.db") client.create_collection( collection_name="demo_collection", dimension="5" )\nDeployment Options to Match Your Unique Journey\nMilvus Lite\nLightweight, easy to start\nVectorDB-as-a-library runs in notebooks/ laptops with a pip install\nBest for learning and prototyping\nMilvus Standalone\nRobust, single-machine deployment\nComplete vector database for production or testing\nIdeal for datasets with up to millions of vectors\nMilvus Distributed\nScalable, enterprise-grade solution\nHighly reliable and distributed vector database with comprehensive toolkit\nScale horizontally to handle billions of vectors\nZilliz Cloud\nFully managed with minimal operations\nAvailable in both serverless and dedicated cluster\nSaaS and BYOC options for different security and compliance requirements\nTry Free\nLearn more about different Milvus deployment models\nLoved by GenAI developers\nBased on our research, Milvus was selected as the vector database of choice (over Chroma and Pinecone). Milvus is an open-source vector database designed specifically for similarity search on massive datasets of high-dimensional vectors.\nWith its focus on efficient vector similarity search, Milvus empowers you to build robust and scalable image retrieval systems. Whether you're managing a personal photo library or developing a commercial image search application, Milvus offers a powerful foundation for unlocking the hidden potential within your image collections.\nBhargav Mankad\nSenior Solution Architect\nMilvus is a powerful vector database tailored for processing and searching extensive vector data. It stands out for its high performance and scalability, rendering it perfect for machine learning, deep learning, similarity search tasks, and recommendation systems.\nIgor Gorbenko\nBig Data Architect\nStart building your GenAI app now\nGuided with notebooks developed by us and our community\nRAG\nTry Now\nImage Search\nTry Now\nMultimodal Search\nTry Now\nUnstructured Data Meetups\nJoin a Community of Passionate Developers and Engineers Dedicated to Gen AI.\nRSVP now\nWhy Developers Prefer Milvus for Vector Databases\nScale as needed\nElastic scaling to tens of billions of vectors with distributed architecture.\nBlazing fast\nRetrieve data quickly and accurately with Global Index, regardless of scale.\nReusable Code\nWrite once, and deploy with one line of code into the production environment.\nFeature-rich\nMetadata filtering, hybrid search, multi-vector and more.\nWant to learn more about Milvus? View our documentation\nJoin the community of developers building GenAI apps with Milvus, now with over 25 million downloads\nGet Milvus Updates\nSubscribe to get updates on the latest Milvus releases, tutorials and training from Zilliz, the creator and key maintainer of Milvus.'

要将数据上传到 Milvus 数据库,我们使用 Apify Milvus 集成。首先,我们需要为 Milvus 数据库设置参数。接下来,我们选择要存储在数据库中的字段(datasetFields)。在下面的示例中,我们保存 text 字段和 metadata.title

milvus_integration_inputs = {
"milvusUri": os.getenv("MILVUS_URI"),
"milvusToken": os.getenv("MILVUS_TOKEN"),
"milvusCollectionName": MILVUS_COLLECTION_NAME,
"datasetFields": ["text", "metadata.title"],
"datasetId": actor_call["defaultDatasetId"],
"performChunking": True,
"embeddingsApiKey": os.getenv("OPENAI_API_KEY"),
"embeddingsProvider": "OpenAI",
}

现在,我们将调用 apify/milvus-integration 来存储数据

actor_call = client.actor("apify/milvus-integration").call(
run_input=milvus_integration_inputs
)

所有爬取的数据现在都存储在 Milvus 数据库中,可以进行检索和问答

检索和 LLM 生成流水线

接下来,我们将使用 Langchain 定义检索增强流水线。该流水线分为两个阶段:

  • 向量存储(Milvus):Langchain 通过将查询嵌入与存储的文档嵌入匹配,从 Milvus 检索相关文档。
  • LLM 响应:检索到的文档为 LLM(例如 GPT-4)提供上下文以生成知情的答案。

有关 RAG 链的更多详细信息,请参阅 Langchain 文档

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_milvus.vectorstores import Milvus
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Milvus(
connection_args={
"uri": os.getenv("MILVUS_URI"),
"token": os.getenv("MILVUS_TOKEN"),
},
embedding_function=embeddings,
collection_name=MILVUS_COLLECTION_NAME,
)

prompt = PromptTemplate(
input_variables=["context", "question"],
template="Use the following pieces of retrieved context to answer the question. If you don't know the answer, "
"just say that you don't know. \nQuestion: {question} \nContext: {context} \nAnswer:",
)


def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
{
"context": vectorstore.as_retriever() | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| ChatOpenAI(model="gpt-4o-mini")
| StrOutputParser()
)

一旦我们在数据库中有了数据,我们就可以开始提问


question = "What is Milvus database?"

rag_chain.invoke(question)

'Milvus is an open-source vector database specifically designed for billion-scale vector similarity search. It facilitates efficient management and querying of vector data, which is essential for applications involving unstructured data, such as AI and machine learning. Milvus allows users to perform operations like CRUD (Create, Read, Update, Delete) and vector searches, making it a powerful tool for handling large datasets.'

结论

在本教程中,我们演示了如何使用 Apify 爬取网站内容,将数据存储在 Milvus 向量数据库中,并使用检索增强流水线执行问答任务。通过将 Apify 的网络爬取功能与 Milvus/Zilliz 的向量存储和 Langchain 的语言模型相结合,您可以构建高效的信息检索系统。

为了改进数据收集和数据库更新,Apify 集成提供增量更新,它基于校验和仅更新新的或修改的数据。此外,它可以自动删除过时的在指定时间内未被爬取的数据。这些功能有助于保持您的向量数据库优化,并确保您的检索增强流水线保持高效和最新,而无需大量手动工作。

有关 Apify-Milvus 集成的更多详细信息,请参阅 Apify Milvus 文档集成 README 文件