mirror of https://github.com/openclaw/openclaw.git
The tokenize() function only matched [a-z0-9_]+ patterns, returning an empty set for CJK-only text. This made Jaccard similarity always 0 (or always 1 for two empty sets) for CJK content, effectively disabling MMR diversity detection. Add support for: - CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF) - Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF) - Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF) Characters are extracted as unigrams, and bigrams are generated only from characters that are adjacent in the original text (no spurious bigrams across ASCII boundaries). Fixes #28000 |
||
|---|---|---|
| .. | ||
| memory | ||
| cli.runtime.ts | ||
| cli.test.ts | ||
| cli.ts | ||
| cli.types.ts | ||
| flush-plan.ts | ||
| prompt-section.ts | ||
| runtime-provider.ts | ||
| tools.citations.test.ts | ||
| tools.citations.ts | ||
| tools.runtime.ts | ||
| tools.shared.ts | ||
| tools.test-helpers.ts | ||
| tools.test.ts | ||
| tools.ts | ||