Commit Graph

86 Commits

Author SHA1 Message Date
jnMetaCode 7332e6d609
fix(failover): classify HTTP 422 as format and OpenRouter credits as billing (#43823)
Merged via squash.

Prepared head SHA: 4f48e977fe
Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-13 00:50:28 +03:00
bwjoke fd568c4f74
fix(failover): classify ZenMux quota-refresh 402 as rate_limit (#43917)
Merged via squash.

Prepared head SHA: 1d58a36a77
Co-authored-by: bwjoke <1284814+bwjoke@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-13 00:06:43 +03:00
rabsef-bicrym ff47876e61
fix: carry observed overflow token counts into compaction (#40357)
Merged via squash.

Prepared head SHA: b99eed4329
Co-authored-by: rabsef-bicrym <52549148+rabsef-bicrym@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
2026-03-12 06:58:42 -07:00
ademczuk 58634c9c65
fix(agents): check billing errors before context overflow heuristics (#40409)
Merged via squash.

Prepared head SHA: c88f89c462
Co-authored-by: ademczuk <5212682+ademczuk@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-11 21:08:55 +03:00
CryUshio 8bf64f219a
fix: recognize Poe 402 'used up your points' as billing for fallback (#42278)
Merged via squash.

Prepared head SHA: f3cdfa76dd
Co-authored-by: CryUshio <30655354+CryUshio@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 20:17:36 +03:00
alan blount c9a6c542ef
Add HTTP 499 to transient error codes for model fallback (#41468)
Merged via squash.

Prepared head SHA: 0053bae140
Co-authored-by: zeroasterisk <23422+zeroasterisk@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 01:55:10 +03:00
gambletan 8a20f51460
fix: add rate limit patterns for 'too many tokens' and 'tokens per day' (#39377)
Merged via squash.

Prepared head SHA: 132a457286
Co-authored-by: gambletan <266203672+gambletan@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-08 13:03:33 +03:00
Peter Lee 92648f9ba9
fix(agents): broaden 402 temporary-limit detection and allow billing cooldown probe (#38533)
Merged via squash.

Prepared head SHA: 282b9186c6
Co-authored-by: xialonglee <22994703+xialonglee@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-08 10:27:01 +03:00
Altay 6e962d8b9e
fix(agents): handle overloaded failover separately (#38301)
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Xinhua Gu 01b20172b8
fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484) (#36802)
* fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484)

Some providers (notably Anthropic Claude Max plan) surface temporary
usage/rate-limit failures as HTTP 402 instead of 429. Before this change,
all 402s were unconditionally mapped to 'billing', which produced a
misleading 'run out of credits' warning for Max plan users who simply
hit their usage window.

This follows the same pattern introduced for HTTP 400 in #36783: check
the error message for an explicit rate-limit signal before falling back
to the default status-code classification.

- classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402
  when isRateLimitErrorMessage matches the payload text
- Added regression tests covering both the rate-limit and billing paths
  on 402

* fix: narrow 402 rate-limit matcher to prevent billing misclassification

The original implementation used isRateLimitErrorMessage(), which matches
phrases like 'quota exceeded' that legitimately appear in billing errors.

This commit replaces it with a narrow, 402-specific matcher that requires
BOTH retry language (try again/retry/temporary/cooldown) AND limit
terminology (usage limit/rate limit/organization usage).

Prevents misclassification of errors like:
'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit)

Added regression test for the ambiguous case.

---------

Co-authored-by: Val Alexander <bunsthedev@gmail.com>
2026-03-06 03:45:36 -06:00
zhouhe-xydt a65d70f84b
Fix failover for zhipuai 1310 Weekly/Monthly Limit Exhausted (#33813)
Merged via squash.

Prepared head SHA: 3dc441e58d
Co-authored-by: zhouhe-xydt <265407618+zhouhe-xydt@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-06 12:04:09 +03:00
Altay 49acb07f9f
fix(agents): classify insufficient_quota 400s as billing (#36783) 2026-03-06 01:17:48 +03:00
Altay f014e255df
refactor(agents): share failover HTTP status classification (#36615)
* fix(agents): classify transient failover statuses consistently

* fix(agents): preserve legacy failover status mapping
2026-03-05 23:50:36 +03:00
Kai 60a6d11116
fix(embedded): classify model_context_window_exceeded as context overflow, trigger compaction (#35934)
Merged via squash.

Prepared head SHA: 20fa77289c
Co-authored-by: RealKai42 <44634134+RealKai42@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
2026-03-05 11:30:24 -08:00
Peter Steinberger 6472e03412 refactor(agents): share failover error matchers 2026-03-03 02:51:00 +00:00
AI南柯(KingMo) 30ab9b2068
fix(agents): recognize connection errors as retryable timeout failures (#31697)
* fix(agents): recognize connection errors as retryable timeout failures

## Problem

When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.

## Root Cause

Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)

While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.

## Solution

Extend timeout error patterns to include connection/network failures:

**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i

**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths

## Testing

Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text

## Impact

- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested

* style: fix code formatting for test file
2026-03-03 02:37:23 +00:00
Peter Steinberger 1bd20dbdb6 fix(failover): treat stop reason error as timeout 2026-03-03 01:05:24 +00:00
Peter Steinberger a2fdc3415f fix(failover): handle unhandled stop reason error 2026-03-03 01:05:24 +00:00
Sid 40e078a567
fix(auth): classify permission_error as auth_permanent for profile fallback (#31324)
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.

Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.

Closes #31306

Made-with: Cursor

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-03-01 22:26:05 -08:00
Frank Yang ed86252aa5
fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090)
* fix: handle CLI session expired errors gracefully

- Add session_expired to FailoverReason type
- Add isCliSessionExpiredErrorMessage to detect expired CLI sessions
- Modify runCliAgent to retry with new session when session expires
- Update agentCommand to clear expired session IDs from session store
- Add proper error handling to prevent gateway crashes on expired sessions

Fixes #30986

* fix: add session_expired to AuthProfileFailureReason and missing log import

* fix: type cli-runner usage field to match EmbeddedPiAgentMeta

* fix: harden CLI session-expiry recovery handling

* build: regenerate host env security policy swift

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-03-02 01:11:05 +00:00
Peter Steinberger 250f9e15f5 fix(agents): land #31007 from @HOYALIM
Co-authored-by: Ho Lim <subhoya@gmail.com>
2026-03-02 01:06:00 +00:00
Aleksandrs Tihenko c0026274d9
fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
Peter Steinberger d2597d5ecf fix(agents): harden model fallback failover paths 2026-02-25 03:46:34 +00:00
Peter Steinberger 43f318cd9a fix(agents): reduce billing false positives on long text (#25680)
Land PR #25680 from @lairtonlelis.
Retain explicit status/code/http 402 detection for oversized structured payloads.

Co-authored-by: Ailton <lairton@telnyx.com>
2026-02-25 01:22:17 +00:00
Peter Machona 9ced64054f
fix(auth): classify missing OAuth scopes as auth failures (#24761) 2026-02-24 03:33:44 +00:00
Clawborn 544809b6f6
Add Chinese context overflow patterns to isContextOverflowError (#22855)
Proxy providers returning Chinese error messages (e.g. Chinese LLM
gateways) use patterns like '上下文过长' or '上下文超出' that are not
matched by the existing English-only patterns in isContextOverflowError.
This prevents auto-compaction from triggering, leaving the session stuck.

Add the most common Chinese proxy patterns:
- 上下文过长 (context too long)
- 上下文超出 (context exceeded)
- 上下文长度超 (context length exceeds)
- 超出最大上下文 (exceeds maximum context)
- 请压缩上下文 (please compress context)

Chinese characters are unaffected by toLowerCase() so check the
original message directly.

Closes #22849
2026-02-23 10:54:24 -05:00
Vincent Koc 4f340b8812
fix(agents): avoid classifying reasoning-required errors as context overflow (#24593)
* Agents: exclude reasoning-required errors from overflow detection

* Tests: cover reasoning-required overflow classification guard

* Tests: format reasoning-required endpoint errors
2026-02-23 10:38:49 -05:00
Alice Losasso 652099cd5c
fix: correctly identify Groq TPM limits as rate limits instead of context overflow (#16176)
Co-authored-by: Howard <dddabtc@users.noreply.github.com>
2026-02-23 10:32:53 -05:00
青雲 69692d0d3a
fix: detect additional context overflow error patterns to prevent leak to user (#20539)
* fix: detect additional context overflow error patterns to prevent leak to user

Fixes #9951

The error 'input length and max_tokens exceed context limit: 170636 +
34048 > 200000' was not caught by isContextOverflowError() and leaked
to users via formatAssistantErrorText()'s invalidRequest fallback.

Add three new patterns to isContextOverflowError():
- 'exceed context limit' (direct match)
- 'exceeds the model\'s maximum context'
- max_tokens/input length + exceed + context (compound match)

These are now rewritten to the friendly context overflow message.

* Overflow: add regression tests and changelog credits

* Update CHANGELOG.md

* Update pi-embedded-helpers.isbillingerrormessage.test.ts

---------

Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 10:03:56 -05:00
Peter Steinberger 9bd04849ed fix(agents): detect Kimi model-token-limit overflows
Co-authored-by: Danilo Falcão <danilo@falcao.org>
2026-02-23 12:44:23 +00:00
taw0002 3c57bf4c85
fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
青雲 3dfee78d72
fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595) (#23698)
* fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595)

Mistral requires tool call IDs to be exactly 9 alphanumeric characters
([a-zA-Z0-9]{9}). The existing sanitizeToolCallIdsForCloudCodeAssist
mechanism only ran on historical messages at attempt start via
sanitizeSessionHistory, but the pi-agent-core agent loop's internal
tool call → tool result cycles bypassed that path entirely.

Changes:
- Wrap streamFn (like dropThinkingBlocks) so every outbound request
  sees sanitized tool call IDs when the transcript policy requires it
- Replace call_${Date.now()} in pendingToolCalls with a 9-char hex ID
  generated from crypto.randomBytes
- Add Mistral tool call ID error pattern to ERROR_PATTERNS.format so
  the error is correctly classified for retry/rotation

* Changelog: document Mistral strict9 tool-call ID fix

---------

Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-22 13:37:12 -05:00
Vignesh Natarajan 35fe33aa90 Agents: classify Anthropic api_error internal server failures for fallback 2026-02-21 19:22:16 -08:00
Harry Cui Kepler ffa63173e0
refactor(agents): migrate console.warn/error/info to subsystem logger (#22906)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: a806c4cb27
Co-authored-by: Kepler2024 <166882517+Kepler2024@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-21 17:11:47 -05:00
niceysam 5e423b596c
fix: remove false-positive billing error rewrite on normal assistant text (openclaw#17834) thanks @niceysam
Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: niceysam <256747835+niceysam@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-21 12:17:39 -06:00
mudrii 7ecfc1d93c
fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 2dee8e1174
Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com>
Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com>
Reviewed-by: @obviyus
2026-02-20 16:01:09 +05:30
Protocol Zero 2af3415fac
fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086)
* fix: treat HTTP 503 as failover-eligible for LLM provider errors

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.

This patch closes that gap:

- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
  (covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
  "high demand" (covers message-only classification when the leading
  status prefix is absent)

Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.

* fix: address review feedback — drop /\b503\b/ pattern, add test coverage

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
  semantic inconsistency noted by reviewers: `isTransientHttpError`
  already handles messages prefixed with "503" (→ "timeout"), so a
  redundant overloaded pattern would classify the same class of errors
  differently depending on message formatting.

- Keep "service unavailable" and "high demand" patterns — these are the
  real gap-fillers for SDK-rewritten messages that lack a numeric prefix.

- Add test case for JSON-wrapped 503 error body containing "overloaded"
  to strengthen coverage.

* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)

resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".

Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-19 12:45:09 -08:00
青雲 3d4ef56044
fix: include provider and model name in billing error message (#20510)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 40dbdf62e8
Co-authored-by: echoVic <16428813+echoVic@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-18 21:56:00 -05:00
Peter Steinberger 1934eebbf0 refactor(agents): dedupe lifecycle send assertions and stable payload stringify 2026-02-18 14:15:14 +00:00
Peter Steinberger b8b43175c5 style: align formatting with oxfmt 0.33 2026-02-18 01:34:35 +00:00
Peter Steinberger 31f9be126c style: run oxfmt and fix gate failures 2026-02-18 01:29:02 +00:00
cpojer d0cb8c19b2
chore: wtf. 2026-02-17 13:36:48 +09:00
Sebastian ed11e93cf2 chore(format) 2026-02-16 23:20:16 -05:00
cpojer 90ef2d6bdf
chore: Update formatting. 2026-02-17 09:18:40 +09:00
Daniel Sauer 12ce358da5 fix(failover): recognize 'abort' stop reason as timeout for model fallback
When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort'
due to stream interruption, OpenClaw's failover mechanism did not recognize
this as a timeout condition. This prevented fallback models from being
triggered, leaving users with failed requests instead of graceful failover.

Changes:
- Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts
- Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts

Fixes #18453

Co-authored-by: James <james@openclaw.ai>
2026-02-16 23:49:51 +01:00
Tyler Yust b8f66c260d
Agents: add nested subagent orchestration controls and reduce subagent token waste (#14447)
* Agents: add subagent orchestration controls

* Agents: add subagent orchestration controls (WIP uncommitted changes)

* feat(subagents): add depth-based spawn gating for sub-sub-agents

* feat(subagents): tool policy, registry, and announce chain for nested agents

* feat(subagents): system prompt, docs, changelog for nested sub-agents

* fix(subagents): prevent model fallback override, show model during active runs, and block context overflow fallback

Bug 1: When a session has an explicit model override (e.g., gpt/openai-codex),
the fallback candidate logic in resolveFallbackCandidates silently appended the
global primary model (opus) as a backstop. On reinjection/steer with a transient
error, the session could fall back to opus which has a smaller context window
and crash. Fix: when storedModelOverride is set, pass fallbacksOverride ?? []
instead of undefined, preventing the implicit primary backstop.

Bug 2: Active subagents showed 'model n/a' in /subagents list because
resolveModelDisplay only read entry.model/modelProvider (populated after run
completes). Fix: fall back to modelOverride/providerOverride fields which are
populated at spawn time via sessions.patch.

Bug 3: Context overflow errors (prompt too long, context_length_exceeded) could
theoretically escape runEmbeddedPiAgent and be treated as failover candidates
in runWithModelFallback, causing a switch to a model with a smaller context
window. Fix: in runWithModelFallback, detect context overflow errors via
isLikelyContextOverflowError and rethrow them immediately instead of trying the
next model candidate.

* fix(subagents): track spawn depth in session store and fix announce routing for nested agents

* Fix compaction status tracking and dedupe overflow compaction triggers

* fix(subagents): enforce depth block via session store and implement cascade kill

* fix: inject group chat context into system prompt

* fix(subagents): always write model to session store at spawn time

* Preserve spawnDepth when agent handler rewrites session entry

* fix(subagents): suppress announce on steer-restart

* fix(subagents): fallback spawned session model to runtime default

* fix(subagents): enforce spawn depth when caller key resolves by sessionId

* feat(subagents): implement active-first ordering for numeric targets and enhance task display

- Added a test to verify that subagents with numeric targets follow an active-first list ordering.
- Updated `resolveSubagentTarget` to sort subagent runs based on active status and recent activity.
- Enhanced task display in command responses to prevent truncation of long task descriptions.
- Introduced new utility functions for compacting task text and managing subagent run states.

* fix(subagents): show model for active runs via run record fallback

When the spawned model matches the agent's default model, the session
store's override fields are intentionally cleared (isDefault: true).
The model/modelProvider fields are only populated after the run
completes. This left active subagents showing 'model n/a'.

Fix: store the resolved model on SubagentRunRecord at registration
time, and use it as a fallback in both display paths (subagents tool
and /subagents command) when the session store entry has no model info.

Changes:
- SubagentRunRecord: add optional model field
- registerSubagentRun: accept and persist model param
- sessions-spawn-tool: pass resolvedModel to registerSubagentRun
- subagents-tool: pass run record model as fallback to resolveModelDisplay
- commands-subagents: pass run record model as fallback to resolveModelDisplay

* feat(chat): implement session key resolution and reset on sidebar navigation

- Added functions to resolve the main session key and reset chat state when switching sessions from the sidebar.
- Updated the `renderTab` function to handle session key changes when navigating to the chat tab.
- Introduced a test to verify that the session resets to "main" when opening chat from the sidebar navigation.

* fix: subagent timeout=0 passthrough and fallback prompt duplication

Bug 1: runTimeoutSeconds=0 now means 'no timeout' instead of applying 600s default
- sessions-spawn-tool: default to undefined (not 0) when neither timeout param
  is provided; use != null check so explicit 0 passes through to gateway
- agent.ts: accept 0 as valid timeout (resolveAgentTimeoutMs already handles
  0 → MAX_SAFE_TIMEOUT_MS)

Bug 2: model fallback no longer re-injects the original prompt as a duplicate
- agent.ts: track fallback attempt index; on retries use a short continuation
  message instead of the full original prompt since the session file already
  contains it from the first attempt
- Also skip re-sending images on fallback retries (already in session)

* feat(subagents): truncate long task descriptions in subagents command output

- Introduced a new utility function to format task previews, limiting their length to improve readability.
- Updated the command handler to use the new formatting function, ensuring task descriptions are truncated appropriately.
- Adjusted related tests to verify that long task descriptions are now truncated in the output.

* refactor(subagents): update subagent registry path resolution and improve command output formatting

- Replaced direct import of STATE_DIR with a utility function to resolve the state directory dynamically.
- Enhanced the formatting of command output for active and recent subagents, adding separators for better readability.
- Updated related tests to reflect changes in command output structure.

* fix(subagent): default sessions_spawn to no timeout when runTimeoutSeconds omitted

The previous fix (75a791106) correctly handled the case where
runTimeoutSeconds was explicitly set to 0 ("no timeout"). However,
when models omit the parameter entirely (which is common since the
schema marks it as optional), runTimeoutSeconds resolved to undefined.

undefined flowed through the chain as:
  sessions_spawn → timeout: undefined (since undefined != null is false)
  → gateway agent handler → agentCommand opts.timeout: undefined
  → resolveAgentTimeoutMs({ overrideSeconds: undefined })
  → DEFAULT_AGENT_TIMEOUT_SECONDS (600s = 10 minutes)

This caused subagents to be killed at exactly 10 minutes even though
the user's intent (via TOOLS.md) was for subagents to run without a
timeout.

Fix: default runTimeoutSeconds to 0 (no timeout) when neither
runTimeoutSeconds nor timeoutSeconds is provided by the caller.
Subagent spawns are long-running by design and should not inherit the
600s agent-command default timeout.

* fix(subagent): accept timeout=0 in agent-via-gateway path (second 600s default)

* fix: thread timeout override through getReplyFromConfig dispatch path

getReplyFromConfig called resolveAgentTimeoutMs({ cfg }) with no override,
always falling back to the config default (600s). Add timeoutOverrideSeconds
to GetReplyOptions and pass it through as overrideSeconds so callers of the
dispatch chain can specify a custom timeout (0 = no timeout).

This complements the existing timeout threading in agentCommand and the
cron isolated-agent runner, which already pass overrideSeconds correctly.

* feat(model-fallback): normalize OpenAI Codex model references and enhance fallback handling

- Added normalization for OpenAI Codex model references, specifically converting "gpt-5.3-codex" to "openai-codex" before execution.
- Updated the `resolveFallbackCandidates` function to utilize the new normalization logic.
- Enhanced tests to verify the correct behavior of model normalization and fallback mechanisms.
- Introduced a new test case to ensure that the normalization process works as expected for various input formats.

* feat(tests): add unit tests for steer failure behavior in openclaw-tools

- Introduced a new test file to validate the behavior of subagents when steer replacement dispatch fails.
- Implemented tests to ensure that the announce behavior is restored correctly and that the suppression reason is cleared as expected.
- Enhanced the subagent registry with a new function to clear steer restart suppression.
- Updated related components to support the new test scenarios.

* fix(subagents): replace stop command with kill in slash commands and documentation

- Updated the `/subagents` command to replace `stop` with `kill` for consistency in controlling sub-agent runs.
- Modified related documentation to reflect the change in command usage.
- Removed legacy timeoutSeconds references from the sessions-spawn-tool schema and tests to streamline timeout handling.
- Enhanced tests to ensure correct behavior of the updated commands and their interactions.

* feat(tests): add unit tests for readLatestAssistantReply function

- Introduced a new test file for the `readLatestAssistantReply` function to validate its behavior with various message scenarios.
- Implemented tests to ensure the function correctly retrieves the latest assistant message and handles cases where the latest message has no text.
- Mocked the gateway call to simulate different message histories for comprehensive testing.

* feat(tests): enhance subagent kill-all cascade tests and announce formatting

- Added a new test to verify that the `kill-all` command cascades through ended parents to active descendants in subagents.
- Updated the subagent announce formatting tests to reflect changes in message structure, including the replacement of "Findings:" with "Result:" and the addition of new expectations for message content.
- Improved the handling of long findings and stats in the announce formatting logic to ensure concise output.
- Refactored related functions to enhance clarity and maintainability in the subagent registry and tools.

* refactor(subagent): update announce formatting and remove unused constants

- Modified the subagent announce formatting to replace "Findings:" with "Result:" and adjusted related expectations in tests.
- Removed constants for maximum announce findings characters and summary words, simplifying the announcement logic.
- Updated the handling of findings to retain full content instead of truncating, ensuring more informative outputs.
- Cleaned up unused imports in the commands-subagents file to enhance code clarity.

* feat(tests): enhance billing error handling in user-facing text

- Added tests to ensure that normal text mentioning billing plans is not rewritten, preserving user context.
- Updated the `isBillingErrorMessage` and `sanitizeUserFacingText` functions to improve handling of billing-related messages.
- Introduced new test cases for various scenarios involving billing messages to ensure accurate processing and output.
- Enhanced the subagent announce flow to correctly manage active descendant runs, preventing premature announcements.

* feat(subagent): enhance workflow guidance and auto-announcement clarity

- Added a new guideline in the subagent system prompt to emphasize trust in push-based completion, discouraging busy polling for status updates.
- Updated documentation to clarify that sub-agents will automatically announce their results, improving user understanding of the workflow.
- Enhanced tests to verify the new guidance on avoiding polling loops and to ensure the accuracy of the updated prompts.

* fix(cron): avoid announcing interim subagent spawn acks

* chore: clean post-rebase imports

* fix(cron): fall back to child replies when parent stays interim

* fix(subagents): make active-run guidance advisory

* fix(subagents): update announce flow to handle active descendants and enhance test coverage

- Modified the announce flow to defer announcements when active descendant runs are present, ensuring accurate status reporting.
- Updated tests to verify the new behavior, including scenarios where no fallback requester is available and ensuring proper handling of finished subagents.
- Enhanced the announce formatting to include an `expectFinal` flag for better clarity in the announcement process.

* fix(subagents): enhance announce flow and formatting for user updates

- Updated the announce flow to provide clearer instructions for user updates based on active subagent runs and requester context.
- Refactored the announcement logic to improve clarity and ensure internal context remains private.
- Enhanced tests to verify the new message expectations and formatting, including updated prompts for user-facing updates.
- Introduced a new function to build reply instructions based on session context, improving the overall announcement process.

* fix: resolve prep blockers and changelog placement (#14447) (thanks @tyler6204)

* fix: restore cron delivery-plan import after rebase (#14447) (thanks @tyler6204)

* fix: resolve test failures from rebase conflicts (#14447) (thanks @tyler6204)

* fix: apply formatting after rebase (#14447) (thanks @tyler6204)
2026-02-14 22:03:45 -08:00
Vignesh Natarajan eb846c95bf fix (agents): classify empty-chunk stream failures as timeout 2026-02-14 18:54:03 -08:00
Peter Steinberger d714ac7797
refactor(agents): dedupe transient error copy (#16324) 2026-02-14 17:49:25 +01:00
Vincent 478af81706
Return user-facing message if API reuturn 429 API rate limit reached #2202 (#10415)
* Return user-facing message if API reuturn 429 API rate limit reached

* clarify the error message

* fix(agents): improve 429 user messaging (#10415) (thanks @vincenthsin)

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-02-14 17:40:02 +01:00
Peter Steinberger 50a6e0e69e
fix: strip leading empty lines in sanitizeUserFacingText (#16280)
* fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)

* fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)

* fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)
2026-02-14 16:34:02 +01:00