* fix: wire memorySearch.extraPaths to QMD indexing
The 'agents.defaults.memorySearch.extraPaths' config field was documented
to add extra directories to the memory index, but the paths were never
actually passed to the QMD backend. Only 'memory.qmd.paths' worked.
This fix reads extraPaths from the memorySearch config and maps them
to QMD custom path collections, so users can simply configure:
memorySearch:
extraPaths:
- odd-vault
- /Users/odd/workspace
- /Users/odd/docs
And have those directories indexed alongside the default memory files.
Closes#57302
* fix: handle per-agent memorySearch.extraPaths overrides + add tests
- Read per-agent overrides from agents.list[].memorySearch.extraPaths
- Agent-specific overrides take priority over defaults
- Falls back to defaults when agent has no overrides
- Added 3 test cases for the feature
* fix: merge defaults + agent overrides instead of replacing
* fix: remove any types from tests, fix merge behavior assertion
* fix(memory): merge qmd extra path collections
* fix(memory): normalize qmd extra path resolution
* fix(memory): type qmd extra path merge
---------
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
When re-splitting CJK-heavy segments at chunking.tokens, check whether the
slice boundary falls on a high surrogate (0xD800–0xDBFF) and if so extend
by one code unit to keep the pair intact. Prevents producing broken
surrogate halves for CJK Extension B+ characters (U+20000+).
Add test verifying no lone surrogates appear when splitting lines of
surrogate-pair characters with an odd token budget.
Addresses third-round Codex P2 review comment.
- Two-pass line splitting: first slice at maxChars (unchanged for Latin),
then re-split only CJK-heavy segments at chunking.tokens. This preserves
the original ~800-char segments for ASCII lines while keeping CJK chunks
within the token budget.
- Narrow surrogate-pair adjustment to CJK Extension B+ range (D840–D87E)
only, so emoji surrogate pairs are not affected. Mixed CJK+emoji text
is now handled consistently regardless of composition.
- Add tests: emoji handling (2), Latin backward-compat long-line (1).
Addresses Codex P1 (oversized CJK segments) and P2s (Latin over-splitting,
emoji surrogate inconsistency).
- Use code-point length instead of UTF-16 length in estimateStringChars()
so that CJK Extension B+ surrogate pairs (U+20000+) are counted as 1
character, not 2 (fixes ~25% overestimate for rare characters).
- Change long-line split step from maxChars to chunking.tokens so that
CJK lines are sliced into token-budget-sized segments instead of
char-budget-sized segments that produce ~4x oversized chunks.
- Add tests for both fixes: surrogate-pair handling and long CJK line
splitting.
Addresses review feedback from Greptile and Codex bots.
The QMD memory system uses a fixed 4:1 chars-to-tokens ratio for chunk
sizing, which severely underestimates CJK (Chinese/Japanese/Korean) text
where each character is roughly 1 token. This causes oversized chunks for
CJK users, degrading vector search quality and wasting context window space.
Changes:
- Add shared src/utils/cjk-chars.ts module with CJK-aware character
counting (estimateStringChars) and token estimation helpers
- Update chunkMarkdown() in src/memory/internal.ts to use weighted
character lengths for chunk boundary decisions and overlap calculation
- Replace hardcoded estimateTokensFromChars in the context report
command with the shared utility
- Add 13 unit tests for the CJK estimation module and 5 new tests for
CJK-aware memory chunking behavior
Backward compatible: pure ASCII/Latin text behavior is unchanged.
Closes#39965
Related: #40216