Jieba

jieba tokenizer 通过将中文文本分解为组成单词来处理中文文本。

配置

Milvus 支持 jieba tokenizer 的两种配置方法：简单配置和自定义配置。

简单配置

使用简单配置，您只需要将 tokenizer 设置为 "jieba"。例如：

# Simple configuration: only specifying the tokenizer name
analyzer_params = {
    "tokenizer": "jieba",  # Use the default settings: dict=["_default_"], mode="search", hmm=true
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");

const analyzer_params = {
    "tokenizer": "jieba",
};

analyzerParams = map[string]any{"tokenizer": "jieba"}

# restful
analyzerParams='{
  "tokenizer": "jieba"
}'

这个简单配置等同于以下自定义配置：

Python Java NodeJS Go cURL

# Custom configuration equivalent to the simple configuration above
analyzer_params = {
    "type": "jieba",          # Tokenizer type, fixed as "jieba"
    "dict": ["_default_"],     # Use the default dictionary
    "mode": "search",          # Use search mode for improved recall (see mode details below)
    "hmm": true                # Enable HMM for probabilistic segmentation
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("_default_"));
analyzerParams.put("mode", "search");
analyzerParams.put("hmm", true);

// javascript

analyzerParams = map[string]any{"type": "jieba", "dict": []any{"_default_"}, "mode": "search", "hmm": true}

# restful

有关参数详细信息，请参阅自定义配置。

自定义配置

为了获得更多控制，您可以提供自定义配置，允许您指定自定义词典、选择分词模式以及启用或禁用隐马尔可夫模型（HMM）。例如：

Python Java NodeJS Go cURL

# Custom configuration with user-defined settings
analyzer_params = {
    "tokenizer": {
        "type": "jieba",           # Fixed tokenizer type
        "dict": ["customDictionary"],  # Custom dictionary list; replace with your own terms
        "mode": "exact",           # Use exact mode (non-overlapping tokens)
        "hmm": false               # Disable HMM; unmatched text will be split into individual characters
    }
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("customDictionary"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);

// javascript

analyzerParams = map[string]any{"type": "jieba", "dict": []any{"customDictionary"}, "mode": "exact", "hmm": false}

# restful

参数	描述	默认值
`type`	tokenizer 的类型。此值固定为 `"jieba"`。	`"jieba"`
`dict`	分析器将加载作为其词汇源的词典列表。内置选项： `"default"`：加载引擎的内置简体中文词典。有关详细信息，请参阅 dict.txt。 `"extend_default"`：加载 `"default"` 中的所有内容以及额外的繁体中文补充。有关详细信息，请参阅 dict.txt.big。您还可以将内置词典与任意数量的自定义词典混合。示例：`["default", "结巴分词器"]`。	`["default"]`
`mode`	分词模式。可能的值： `"exact"`：尝试以最精确的方式分割句子，使其非常适合文本分析。 `"search"`：在精确模式的基础上进一步分解长词以提高召回率，使其适合搜索引擎分词。有关更多信息，请参阅 Jieba GitHub 项目。	`"search"`
`hmm`	布尔标志，指示是否启用隐马尔可夫模型（HMM）对词典中未找到的词进行概率分词。	`true`

定义 analyzer_params 后，您可以在定义 collection schema 时将其应用到 VARCHAR 字段。这允许 Milvus 使用指定的分析器处理该字段中的文本，以实现高效的分词和过滤。有关详细信息，请参阅示例用法。

示例

在将分析器配置应用到您的 collection schema 之前，请使用 run_analyzer 方法验证其行为。

分析器配置

Python Java NodeJS Go cURL

analyzer_params = {
    "tokenizer": {
        "type": "jieba",
        "dict": ["结巴分词器"],
        "mode": "exact",
        "hmm": False
    }
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("结巴分词器"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);

// javascript

analyzerParams = map[string]any{"type": "jieba", "dict": []any{"结巴分词器"}, "mode": "exact", "hmm": false}

# restful

使用 `run_analyzer` 验证

Python Java NodeJS Go cURL

# Sample text to analyze
sample_text = "milvus结巴分词器中文测试"

# Run the standard analyzer with the defined configuration
result = MilvusClient.run_analyzer(sample_text, analyzer_params)
print(result)

// java

// javascript

// go

# restful

预期输出

['milvus', '结巴分词器', '中', '文', '测', '试']

配置​

简单配置​

自定义配置​

示例​

分析器配置​

使用 run_analyzer 验证​

预期输出​

配置

简单配置

自定义配置

示例

分析器配置

使用 `run_analyzer` 验证

预期输出