mirror of https://github.com/openclaw/openclaw.git
The tokenize() function only matched [a-z0-9_]+ patterns, returning an empty set for CJK-only text. This made Jaccard similarity always 0 (or always 1 for two empty sets) for CJK content, effectively disabling MMR diversity detection. Add support for: - CJK Unified Ideographs (U+4E00–U+9FFF, U+3400–U+4DBF) - Hiragana (U+3040–U+309F) and Katakana (U+30A0–U+30FF) - Hangul Syllables (U+AC00–U+D7AF) and Jamo (U+1100–U+11FF) Characters are extracted as unigrams, and bigrams are generated only from characters that are adjacent in the original text (no spurious bigrams across ASCII boundaries). Fixes #28000 |
||
|---|---|---|
| .. | ||
| test-helpers | ||
| embedding-manager.test-harness.ts | ||
| embedding.test-mocks.ts | ||
| embeddings.ts | ||
| hybrid.test.ts | ||
| hybrid.ts | ||
| index.test.ts | ||
| index.ts | ||
| manager-embedding-ops.ts | ||
| manager-runtime.ts | ||
| manager-search.ts | ||
| manager-sync-ops.ts | ||
| manager.async-search.test.ts | ||
| manager.atomic-reindex.test.ts | ||
| manager.batch.test.ts | ||
| manager.embedding-batches.test.ts | ||
| manager.get-concurrency.test.ts | ||
| manager.mistral-provider.test.ts | ||
| manager.read-file.test.ts | ||
| manager.readonly-recovery.test.ts | ||
| manager.sync-errors-do-not-crash.test.ts | ||
| manager.ts | ||
| manager.vector-dedupe.test.ts | ||
| manager.watcher-config.test.ts | ||
| mmr.test.ts | ||
| mmr.ts | ||
| provider-adapters.ts | ||
| qmd-manager.test.ts | ||
| qmd-manager.ts | ||
| search-manager.test.ts | ||
| search-manager.ts | ||
| temporal-decay.test.ts | ||
| temporal-decay.ts | ||
| test-embeddings-mock.ts | ||
| test-manager-helpers.ts | ||
| test-manager.ts | ||
| test-runtime-mocks.ts | ||