openclaw

Commit Graph

Author	SHA1	Message	Date
Tak Hoffman	3ce48aff66	Memory: add configurable FTS5 tokenizer for CJK text support (openclaw#56707) Verified: - pnpm build - pnpm check - pnpm test -- extensions/memory-core/src/memory/manager-search.test.ts packages/memory-host-sdk/src/host/query-expansion.test.ts - pnpm test -- extensions/memory-core/src/memory/index.test.ts -t "reindexes when extraPaths change" - pnpm test -- src/config/schema.base.generated.test.ts - pnpm test -- src/media-understanding/image.test.ts - pnpm test Co-authored-by: Mitsuyuki Osabe <24588751+carrotRakko@users.noreply.github.com>	2026-03-28 20:53:29 -05:00
AaronLuo00	f8547fcae4	fix: guard fine-split against breaking UTF-16 surrogate pairs When re-splitting CJK-heavy segments at chunking.tokens, check whether the slice boundary falls on a high surrogate (0xD800–0xDBFF) and if so extend by one code unit to keep the pair intact. Prevents producing broken surrogate halves for CJK Extension B+ characters (U+20000+). Add test verifying no lone surrogates appear when splitting lines of surrogate-pair characters with an odd token budget. Addresses third-round Codex P2 review comment.	2026-03-29 10:22:43 +09:00
AaronLuo00	3b95aa8804	fix: address second-round review — Latin backward compat and emoji consistency - Two-pass line splitting: first slice at maxChars (unchanged for Latin), then re-split only CJK-heavy segments at chunking.tokens. This preserves the original ~800-char segments for ASCII lines while keeping CJK chunks within the token budget. - Narrow surrogate-pair adjustment to CJK Extension B+ range (D840–D87E) only, so emoji surrogate pairs are not affected. Mixed CJK+emoji text is now handled consistently regardless of composition. - Add tests: emoji handling (2), Latin backward-compat long-line (1). Addresses Codex P1 (oversized CJK segments) and P2s (Latin over-splitting, emoji surrogate inconsistency).	2026-03-29 10:22:43 +09:00
AaronLuo00	a5147d4d88	fix: address bot review — surrogate-pair counting and CJK line splitting - Use code-point length instead of UTF-16 length in estimateStringChars() so that CJK Extension B+ surrogate pairs (U+20000+) are counted as 1 character, not 2 (fixes ~25% overestimate for rare characters). - Change long-line split step from maxChars to chunking.tokens so that CJK lines are sliced into token-budget-sized segments instead of char-budget-sized segments that produce ~4x oversized chunks. - Add tests for both fixes: surrogate-pair handling and long CJK line splitting. Addresses review feedback from Greptile and Codex bots.	2026-03-29 10:22:43 +09:00
AaronLuo00	971ecabe80	fix(memory): account for CJK characters in QMD memory chunking The QMD memory system uses a fixed 4:1 chars-to-tokens ratio for chunk sizing, which severely underestimates CJK (Chinese/Japanese/Korean) text where each character is roughly 1 token. This causes oversized chunks for CJK users, degrading vector search quality and wasting context window space. Changes: - Add shared src/utils/cjk-chars.ts module with CJK-aware character counting (estimateStringChars) and token estimation helpers - Update chunkMarkdown() in src/memory/internal.ts to use weighted character lengths for chunk boundary decisions and overlap calculation - Replace hardcoded estimateTokensFromChars in the context report command with the shared utility - Add 13 unit tests for the CJK estimation module and 5 new tests for CJK-aware memory chunking behavior Backward compatible: pure ASCII/Latin text behavior is unchanged. Closes #39965 Related: #40216	2026-03-29 10:22:43 +09:00
Peter Steinberger	4c27c90fc2	refactor: finish moving provider runtime into extensions	2026-03-27 05:38:58 +00:00
Peter Steinberger	64bf80d4d5	refactor: move provider runtime into extensions	2026-03-27 05:38:58 +00:00
Peter Steinberger	eebce9e9c7	refactor: move memory host into sdk package	2026-03-27 04:12:04 +00:00
Peter Steinberger	bd6c7969ea	refactor: extract memory host sdk package	2026-03-27 02:49:33 +00:00
Peter Steinberger	7695b4842b	chore: bump version to 2026.2.12	2026-02-12 18:20:46 +01:00
Peter Steinberger	1872d0c592	chore: bump version to 2026.2.10	2026-02-11 11:27:23 +01:00
cpojer	6fb2d3d7d7	feat: remove slop.	2026-02-03 22:04:17 +09:00
cpojer	8cab78abbc	chore: Run `pnpm format:fix`.	2026-01-31 21:13:13 +09:00
Peter Steinberger	9a7160786a	refactor: rename to openclaw	2026-01-30 03:16:21 +01:00
Peter Steinberger	6d16a658e5	refactor: rename clawdbot to moltbot with legacy compat	2026-01-27 12:21:02 +00:00
Peter Steinberger	72fea5e305	chore: bump version to 2026.1.26	2026-01-27 09:10:47 +00:00

16 Commits