mirror of https://github.com/openclaw/openclaw.git
The tokenize() function only matched [a-z0-9_]+ patterns, returning an empty set for CJK-only text. This made Jaccard similarity always 0 (or always 1 for two empty sets) for CJK content, effectively disabling MMR diversity detection. Add support for: - CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF) - Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF) - Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF) Characters are extracted as unigrams, and bigrams are generated only from characters that are adjacent in the original text (no spurious bigrams across ASCII boundaries). Fixes #28000 |
||
|---|---|---|
| .. | ||
| src | ||
| api.ts | ||
| index.test.ts | ||
| index.ts | ||
| openclaw.plugin.json | ||
| package.json | ||
| runtime-api.ts | ||