When re-splitting CJK-heavy segments at chunking.tokens, check whether the
slice boundary falls on a high surrogate (0xD800–0xDBFF) and if so extend
by one code unit to keep the pair intact. Prevents producing broken
surrogate halves for CJK Extension B+ characters (U+20000+).
Add test verifying no lone surrogates appear when splitting lines of
surrogate-pair characters with an odd token budget.
Addresses third-round Codex P2 review comment.
- Two-pass line splitting: first slice at maxChars (unchanged for Latin),
then re-split only CJK-heavy segments at chunking.tokens. This preserves
the original ~800-char segments for ASCII lines while keeping CJK chunks
within the token budget.
- Narrow surrogate-pair adjustment to CJK Extension B+ range (D840–D87E)
only, so emoji surrogate pairs are not affected. Mixed CJK+emoji text
is now handled consistently regardless of composition.
- Add tests: emoji handling (2), Latin backward-compat long-line (1).
Addresses Codex P1 (oversized CJK segments) and P2s (Latin over-splitting,
emoji surrogate inconsistency).
- Use code-point length instead of UTF-16 length in estimateStringChars()
so that CJK Extension B+ surrogate pairs (U+20000+) are counted as 1
character, not 2 (fixes ~25% overestimate for rare characters).
- Change long-line split step from maxChars to chunking.tokens so that
CJK lines are sliced into token-budget-sized segments instead of
char-budget-sized segments that produce ~4x oversized chunks.
- Add tests for both fixes: surrogate-pair handling and long CJK line
splitting.
Addresses review feedback from Greptile and Codex bots.
The QMD memory system uses a fixed 4:1 chars-to-tokens ratio for chunk
sizing, which severely underestimates CJK (Chinese/Japanese/Korean) text
where each character is roughly 1 token. This causes oversized chunks for
CJK users, degrading vector search quality and wasting context window space.
Changes:
- Add shared src/utils/cjk-chars.ts module with CJK-aware character
counting (estimateStringChars) and token estimation helpers
- Update chunkMarkdown() in src/memory/internal.ts to use weighted
character lengths for chunk boundary decisions and overlap calculation
- Replace hardcoded estimateTokensFromChars in the context report
command with the shared utility
- Add 13 unit tests for the CJK estimation module and 5 new tests for
CJK-aware memory chunking behavior
Backward compatible: pure ASCII/Latin text behavior is unchanged.
Closes#39965
Related: #40216