Jieba
jieba tokenizer 通过将中文文本分解为组成单词来处理中文文本。
配置
Milvus 支持 jieba tokenizer 的两种配置方法:简单配置和自定义配置。
简单配置
使用简单配置,您只需要将 tokenizer 设置为 "jieba"。例如:
# Simple configuration: only specifying the tokenizer name
analyzer_params = {
"tokenizer": "jieba", # Use the default settings: dict=["_default_"], mode="search", hmm=true
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");
const analyzer_params = {
"tokenizer": "jieba",
};
analyzerParams = map[string]any{"tokenizer": "jieba"}
# restful
analyzerParams='{
"tokenizer": "jieba"
}'
这个简单配置等同于以下自定义配置:
# Custom configuration equivalent to the simple configuration above
analyzer_params = {
"type": "jieba", # Tokenizer type, fixed as "jieba"
"dict": ["_default_"], # Use the default dictionary
"mode": "search", # Use search mode for improved recall (see mode details below)
"hmm": true # Enable HMM for probabilistic segmentation
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("_default_"));
analyzerParams.put("mode", "search");
analyzerParams.put("hmm", true);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"_default_"}, "mode": "search", "hmm": true}
# restful
有关参数详细信息,请参阅自定义配置。
自定义配置
为了获得更多控制,您可以提供自定义配置,允许您指定自定义词典、选择分词模式以及启用或禁用隐马尔可夫模型(HMM)。例如:
# Custom configuration with user-defined settings
analyzer_params = {
"tokenizer": {
"type": "jieba", # Fixed tokenizer type
"dict": ["customDictionary"], # Custom dictionary list; replace with your own terms
"mode": "exact", # Use exact mode (non-overlapping tokens)
"hmm": false # Disable HMM; unmatched text will be split into individual characters
}
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("customDictionary"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"customDictionary"}, "mode": "exact", "hmm": false}
# restful
参数 | 描述 | 默认值 |
|---|---|---|
| tokenizer 的类型。此值固定为 |
|
| 分析器将加载作为其词汇源的词典列表。内置选项:
|
|
| 分词模式。可能的值:
|
|
| 布尔标志,指示是否启用隐马尔可夫模型(HMM)对词典中未找到的词进行概率分词。 |
|
定义 analyzer_params 后,您可以在定义 collection schema 时将其应用到 VARCHAR 字段。这允许 Milvus 使用指定的分析器处理该字段中的文本,以实现高效的分词和过滤。有关详细信息,请参阅示例用法。
示例
在将分析器配置应用到您的 collection schema 之前,请使用 run_analyzer 方法验证其行为。
分析器配置
analyzer_params = {
"tokenizer": {
"type": "jieba",
"dict": ["结巴分词器"],
"mode": "exact",
"hmm": False
}
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("结巴分词器"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"结巴分词器"}, "mode": "exact", "hmm": false}
# restful
使用 run_analyzer 验证
# Sample text to analyze
sample_text = "milvus结巴分词器中文测试"
# Run the standard analyzer with the defined configuration
result = MilvusClient.run_analyzer(sample_text, analyzer_params)
print(result)
// java
// javascript
// go
# restful
预期输出
['milvus', '结巴分词器', '中', '文', '测', '试']