openclaw

Commit Graph

Author	SHA1	Message	Date
jnMetaCode	7332e6d609	fix(failover): classify HTTP 422 as format and OpenRouter credits as billing (#43823 ) Merged via squash. Prepared head SHA: `4f48e977fe` Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-13 00:50:28 +03:00
bwjoke	fd568c4f74	fix(failover): classify ZenMux quota-refresh 402 as rate_limit (#43917 ) Merged via squash. Prepared head SHA: `1d58a36a77` Co-authored-by: bwjoke <1284814+bwjoke@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-13 00:06:43 +03:00
rabsef-bicrym	ff47876e61	fix: carry observed overflow token counts into compaction (#40357 ) Merged via squash. Prepared head SHA: `b99eed4329` Co-authored-by: rabsef-bicrym <52549148+rabsef-bicrym@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-12 06:58:42 -07:00
ademczuk	58634c9c65	fix(agents): check billing errors before context overflow heuristics (#40409 ) Merged via squash. Prepared head SHA: `c88f89c462` Co-authored-by: ademczuk <5212682+ademczuk@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-11 21:08:55 +03:00
CryUshio	8bf64f219a	fix: recognize Poe 402 'used up your points' as billing for fallback (#42278 ) Merged via squash. Prepared head SHA: `f3cdfa76dd` Co-authored-by: CryUshio <30655354+CryUshio@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-10 20:17:36 +03:00
alan blount	c9a6c542ef	Add HTTP 499 to transient error codes for model fallback (#41468 ) Merged via squash. Prepared head SHA: `0053bae140` Co-authored-by: zeroasterisk <23422+zeroasterisk@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-10 01:55:10 +03:00
gambletan	8a20f51460	fix: add rate limit patterns for 'too many tokens' and 'tokens per day' (#39377 ) Merged via squash. Prepared head SHA: `132a457286` Co-authored-by: gambletan <266203672+gambletan@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-08 13:03:33 +03:00
Peter Lee	92648f9ba9	fix(agents): broaden 402 temporary-limit detection and allow billing cooldown probe (#38533 ) Merged via squash. Prepared head SHA: `282b9186c6` Co-authored-by: xialonglee <22994703+xialonglee@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-08 10:27:01 +03:00
Altay	6e962d8b9e	fix(agents): handle overloaded failover separately (#38301 ) * fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry	2026-03-07 01:42:11 +03:00
Xinhua Gu	01b20172b8	fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484 ) (#36802 ) * fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484) Some providers (notably Anthropic Claude Max plan) surface temporary usage/rate-limit failures as HTTP 402 instead of 429. Before this change, all 402s were unconditionally mapped to 'billing', which produced a misleading 'run out of credits' warning for Max plan users who simply hit their usage window. This follows the same pattern introduced for HTTP 400 in #36783: check the error message for an explicit rate-limit signal before falling back to the default status-code classification. - classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402 when isRateLimitErrorMessage matches the payload text - Added regression tests covering both the rate-limit and billing paths on 402 * fix: narrow 402 rate-limit matcher to prevent billing misclassification The original implementation used isRateLimitErrorMessage(), which matches phrases like 'quota exceeded' that legitimately appear in billing errors. This commit replaces it with a narrow, 402-specific matcher that requires BOTH retry language (try again/retry/temporary/cooldown) AND limit terminology (usage limit/rate limit/organization usage). Prevents misclassification of errors like: 'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit) Added regression test for the ambiguous case. --------- Co-authored-by: Val Alexander <bunsthedev@gmail.com>	2026-03-06 03:45:36 -06:00
zhouhe-xydt	a65d70f84b	Fix failover for zhipuai 1310 Weekly/Monthly Limit Exhausted (#33813 ) Merged via squash. Prepared head SHA: `3dc441e58d` Co-authored-by: zhouhe-xydt <265407618+zhouhe-xydt@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-06 12:04:09 +03:00
Altay	49acb07f9f	fix(agents): classify insufficient_quota 400s as billing (#36783 )	2026-03-06 01:17:48 +03:00
Altay	f014e255df	refactor(agents): share failover HTTP status classification (#36615 ) * fix(agents): classify transient failover statuses consistently * fix(agents): preserve legacy failover status mapping	2026-03-05 23:50:36 +03:00
Kai	60a6d11116	fix(embedded): classify model_context_window_exceeded as context overflow, trigger compaction (#35934 ) Merged via squash. Prepared head SHA: `20fa77289c` Co-authored-by: RealKai42 <44634134+RealKai42@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-05 11:30:24 -08:00
Peter Steinberger	6472e03412	refactor(agents): share failover error matchers	2026-03-03 02:51:00 +00:00
AI南柯(KingMo)	30ab9b2068	fix(agents): recognize connection errors as retryable timeout failures (#31697 ) * fix(agents): recognize connection errors as retryable timeout failures ## Problem When a model endpoint becomes unreachable (e.g., local proxy down, relay server offline), the failover system fails to switch to the next candidate model. Errors like "Connection error." are not classified as retryable, causing the session to hang on a broken endpoint instead of falling back to healthy alternatives. ## Root Cause Connection/network errors are not recognized by the current failover classifier: - Text patterns like "Connection error.", "fetch failed", "network error" - Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text) While `failover-error.ts` handles these as error codes (err.code), it misses them when they appear as plain text in error messages. ## Solution Extend timeout error patterns to include connection/network failures: In `errors.ts` (ERROR_PATTERNS.timeout): - Text: "connection error", "network error", "fetch failed", etc. - Regex: /\beconn(?:refused\|reset\|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i In `failover-error.ts` (TIMEOUT_HINT_RE): - Same patterns for non-assistant error paths ## Testing Added test cases covering: - "Connection error." - "fetch failed" - "network error: ECONNREFUSED" - "ENOTFOUND" / "EAI_AGAIN" in message text ## Impact - Compatibility: High - only expands retryable error detection - Behavior: Connection failures now trigger automatic fallback - Risk: Low - changes are additive and well-tested * style: fix code formatting for test file	2026-03-03 02:37:23 +00:00
Peter Steinberger	1bd20dbdb6	fix(failover): treat stop reason error as timeout	2026-03-03 01:05:24 +00:00
Peter Steinberger	a2fdc3415f	fix(failover): handle unhandled stop reason error	2026-03-03 01:05:24 +00:00
Sid	40e078a567	fix(auth): classify permission_error as auth_permanent for profile fallback (#31324 ) When an OAuth auth profile returns HTTP 403 with permission_error (e.g. expired plan), the error was not matched by the authPermanent patterns. This caused the profile to receive only a short cooldown instead of being disabled, so the gateway kept retrying the same broken profile indefinitely. Add "permission_error" and "not allowed for this organization" to the authPermanent error patterns so these errors trigger the longer billing/auth_permanent disable window and proper profile rotation. Closes #31306 Made-with: Cursor Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-03-01 22:26:05 -08:00
Frank Yang	ed86252aa5	fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090 ) * fix: handle CLI session expired errors gracefully - Add session_expired to FailoverReason type - Add isCliSessionExpiredErrorMessage to detect expired CLI sessions - Modify runCliAgent to retry with new session when session expires - Update agentCommand to clear expired session IDs from session store - Add proper error handling to prevent gateway crashes on expired sessions Fixes #30986 * fix: add session_expired to AuthProfileFailureReason and missing log import * fix: type cli-runner usage field to match EmbeddedPiAgentMeta * fix: harden CLI session-expiry recovery handling * build: regenerate host env security policy swift --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>	2026-03-02 01:11:05 +00:00
Peter Steinberger	250f9e15f5	fix(agents): land #31007 from @HOYALIM Co-authored-by: Ho Lim <subhoya@gmail.com>	2026-03-02 01:06:00 +00:00
Aleksandrs Tihenko	c0026274d9	fix(auth): distinguish revoked API keys from transient auth errors (#25754 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `8f9c07a200` Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-25 19:47:16 -05:00
Peter Steinberger	d2597d5ecf	fix(agents): harden model fallback failover paths	2026-02-25 03:46:34 +00:00
Peter Steinberger	43f318cd9a	fix(agents): reduce billing false positives on long text (#25680 ) Land PR #25680 from @lairtonlelis. Retain explicit status/code/http 402 detection for oversized structured payloads. Co-authored-by: Ailton <lairton@telnyx.com>	2026-02-25 01:22:17 +00:00
Peter Machona	9ced64054f	fix(auth): classify missing OAuth scopes as auth failures (#24761 )	2026-02-24 03:33:44 +00:00
Clawborn	544809b6f6	Add Chinese context overflow patterns to isContextOverflowError (#22855 ) Proxy providers returning Chinese error messages (e.g. Chinese LLM gateways) use patterns like '上下文过长' or '上下文超出' that are not matched by the existing English-only patterns in isContextOverflowError. This prevents auto-compaction from triggering, leaving the session stuck. Add the most common Chinese proxy patterns: - 上下文过长 (context too long) - 上下文超出 (context exceeded) - 上下文长度超 (context length exceeds) - 超出最大上下文 (exceeds maximum context) - 请压缩上下文 (please compress context) Chinese characters are unaffected by toLowerCase() so check the original message directly. Closes #22849	2026-02-23 10:54:24 -05:00
Vincent Koc	4f340b8812	fix(agents): avoid classifying reasoning-required errors as context overflow (#24593 ) * Agents: exclude reasoning-required errors from overflow detection * Tests: cover reasoning-required overflow classification guard * Tests: format reasoning-required endpoint errors	2026-02-23 10:38:49 -05:00
Alice Losasso	652099cd5c	fix: correctly identify Groq TPM limits as rate limits instead of context overflow (#16176 ) Co-authored-by: Howard <dddabtc@users.noreply.github.com>	2026-02-23 10:32:53 -05:00
青雲	69692d0d3a	fix: detect additional context overflow error patterns to prevent leak to user (#20539 ) * fix: detect additional context overflow error patterns to prevent leak to user Fixes #9951 The error 'input length and max_tokens exceed context limit: 170636 + 34048 > 200000' was not caught by isContextOverflowError() and leaked to users via formatAssistantErrorText()'s invalidRequest fallback. Add three new patterns to isContextOverflowError(): - 'exceed context limit' (direct match) - 'exceeds the model\'s maximum context' - max_tokens/input length + exceed + context (compound match) These are now rewritten to the friendly context overflow message. * Overflow: add regression tests and changelog credits * Update CHANGELOG.md * Update pi-embedded-helpers.isbillingerrormessage.test.ts --------- Co-authored-by: echoVic <AkiraVic@outlook.com> Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-23 10:03:56 -05:00
Peter Steinberger	9bd04849ed	fix(agents): detect Kimi model-token-limit overflows Co-authored-by: Danilo Falcão <danilo@falcao.org>	2026-02-23 12:44:23 +00:00
taw0002	3c57bf4c85	fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017 ) * fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) When a model API returns 502 Bad Gateway, 503 Service Unavailable, or 504 Gateway Timeout, the error object carries the status code directly. resolveFailoverReasonFromError() only checked 402/429/401/403/408/400, so 5xx server errors fell through to message-based classification which requires the status code to appear at the start of the error message. Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing the message with '503', so the message classifier never matched and failover never triggered — the run retried the same broken model. Add 502/503/504 to the status-code branch, returning 'timeout' (matching the existing behavior of isTransientHttpError in the message classifier). Fixes #20999 * Changelog: add failover 502/503/504 note with credits * Failover: classify HTTP 504 as transient in message parser * Changelog: credit taw0002 and vincentkoc for failover fix --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-23 03:01:57 -05:00
青雲	3dfee78d72	fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595 ) (#23698 ) * fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595) Mistral requires tool call IDs to be exactly 9 alphanumeric characters ([a-zA-Z0-9]{9}). The existing sanitizeToolCallIdsForCloudCodeAssist mechanism only ran on historical messages at attempt start via sanitizeSessionHistory, but the pi-agent-core agent loop's internal tool call → tool result cycles bypassed that path entirely. Changes: - Wrap streamFn (like dropThinkingBlocks) so every outbound request sees sanitized tool call IDs when the transcript policy requires it - Replace call_${Date.now()} in pendingToolCalls with a 9-char hex ID generated from crypto.randomBytes - Add Mistral tool call ID error pattern to ERROR_PATTERNS.format so the error is correctly classified for retry/rotation * Changelog: document Mistral strict9 tool-call ID fix --------- Co-authored-by: echoVic <AkiraVic@outlook.com> Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-22 13:37:12 -05:00
Vignesh Natarajan	35fe33aa90	Agents: classify Anthropic api_error internal server failures for fallback	2026-02-21 19:22:16 -08:00
Harry Cui Kepler	ffa63173e0	refactor(agents): migrate console.warn/error/info to subsystem logger (#22906 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `a806c4cb27` Co-authored-by: Kepler2024 <166882517+Kepler2024@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-21 17:11:47 -05:00
niceysam	5e423b596c	fix: remove false-positive billing error rewrite on normal assistant text (openclaw#17834) thanks @niceysam Verified: - pnpm install --frozen-lockfile - pnpm build - pnpm check - pnpm test:macmini Co-authored-by: niceysam <256747835+niceysam@users.noreply.github.com> Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>	2026-02-21 12:17:39 -06:00
mudrii	7ecfc1d93c	fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `2dee8e1174` Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Reviewed-by: @obviyus	2026-02-20 16:01:09 +05:30
Protocol Zero	2af3415fac	fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086 ) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-19 12:45:09 -08:00
青雲	3d4ef56044	fix: include provider and model name in billing error message (#20510 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `40dbdf62e8` Co-authored-by: echoVic <16428813+echoVic@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-18 21:56:00 -05:00
Peter Steinberger	1934eebbf0	refactor(agents): dedupe lifecycle send assertions and stable payload stringify	2026-02-18 14:15:14 +00:00
Peter Steinberger	b8b43175c5	style: align formatting with oxfmt 0.33	2026-02-18 01:34:35 +00:00
Peter Steinberger	31f9be126c	style: run oxfmt and fix gate failures	2026-02-18 01:29:02 +00:00
cpojer	d0cb8c19b2	chore: wtf.	2026-02-17 13:36:48 +09:00
Sebastian	ed11e93cf2	chore(format)	2026-02-16 23:20:16 -05:00
cpojer	90ef2d6bdf	chore: Update formatting.	2026-02-17 09:18:40 +09:00
Daniel Sauer	12ce358da5	fix(failover): recognize 'abort' stop reason as timeout for model fallback When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort' due to stream interruption, OpenClaw's failover mechanism did not recognize this as a timeout condition. This prevented fallback models from being triggered, leaving users with failed requests instead of graceful failover. Changes: - Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts - Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts Fixes #18453 Co-authored-by: James <james@openclaw.ai>	2026-02-16 23:49:51 +01:00
Tyler Yust	b8f66c260d	Agents: add nested subagent orchestration controls and reduce subagent token waste (#14447 ) * Agents: add subagent orchestration controls * Agents: add subagent orchestration controls (WIP uncommitted changes) * feat(subagents): add depth-based spawn gating for sub-sub-agents * feat(subagents): tool policy, registry, and announce chain for nested agents * feat(subagents): system prompt, docs, changelog for nested sub-agents * fix(subagents): prevent model fallback override, show model during active runs, and block context overflow fallback Bug 1: When a session has an explicit model override (e.g., gpt/openai-codex), the fallback candidate logic in resolveFallbackCandidates silently appended the global primary model (opus) as a backstop. On reinjection/steer with a transient error, the session could fall back to opus which has a smaller context window and crash. Fix: when storedModelOverride is set, pass fallbacksOverride ?? [] instead of undefined, preventing the implicit primary backstop. Bug 2: Active subagents showed 'model n/a' in /subagents list because resolveModelDisplay only read entry.model/modelProvider (populated after run completes). Fix: fall back to modelOverride/providerOverride fields which are populated at spawn time via sessions.patch. Bug 3: Context overflow errors (prompt too long, context_length_exceeded) could theoretically escape runEmbeddedPiAgent and be treated as failover candidates in runWithModelFallback, causing a switch to a model with a smaller context window. Fix: in runWithModelFallback, detect context overflow errors via isLikelyContextOverflowError and rethrow them immediately instead of trying the next model candidate. * fix(subagents): track spawn depth in session store and fix announce routing for nested agents * Fix compaction status tracking and dedupe overflow compaction triggers * fix(subagents): enforce depth block via session store and implement cascade kill * fix: inject group chat context into system prompt * fix(subagents): always write model to session store at spawn time * Preserve spawnDepth when agent handler rewrites session entry * fix(subagents): suppress announce on steer-restart * fix(subagents): fallback spawned session model to runtime default * fix(subagents): enforce spawn depth when caller key resolves by sessionId * feat(subagents): implement active-first ordering for numeric targets and enhance task display - Added a test to verify that subagents with numeric targets follow an active-first list ordering. - Updated `resolveSubagentTarget` to sort subagent runs based on active status and recent activity. - Enhanced task display in command responses to prevent truncation of long task descriptions. - Introduced new utility functions for compacting task text and managing subagent run states. * fix(subagents): show model for active runs via run record fallback When the spawned model matches the agent's default model, the session store's override fields are intentionally cleared (isDefault: true). The model/modelProvider fields are only populated after the run completes. This left active subagents showing 'model n/a'. Fix: store the resolved model on SubagentRunRecord at registration time, and use it as a fallback in both display paths (subagents tool and /subagents command) when the session store entry has no model info. Changes: - SubagentRunRecord: add optional model field - registerSubagentRun: accept and persist model param - sessions-spawn-tool: pass resolvedModel to registerSubagentRun - subagents-tool: pass run record model as fallback to resolveModelDisplay - commands-subagents: pass run record model as fallback to resolveModelDisplay * feat(chat): implement session key resolution and reset on sidebar navigation - Added functions to resolve the main session key and reset chat state when switching sessions from the sidebar. - Updated the `renderTab` function to handle session key changes when navigating to the chat tab. - Introduced a test to verify that the session resets to "main" when opening chat from the sidebar navigation. * fix: subagent timeout=0 passthrough and fallback prompt duplication Bug 1: runTimeoutSeconds=0 now means 'no timeout' instead of applying 600s default - sessions-spawn-tool: default to undefined (not 0) when neither timeout param is provided; use != null check so explicit 0 passes through to gateway - agent.ts: accept 0 as valid timeout (resolveAgentTimeoutMs already handles 0 → MAX_SAFE_TIMEOUT_MS) Bug 2: model fallback no longer re-injects the original prompt as a duplicate - agent.ts: track fallback attempt index; on retries use a short continuation message instead of the full original prompt since the session file already contains it from the first attempt - Also skip re-sending images on fallback retries (already in session) * feat(subagents): truncate long task descriptions in subagents command output - Introduced a new utility function to format task previews, limiting their length to improve readability. - Updated the command handler to use the new formatting function, ensuring task descriptions are truncated appropriately. - Adjusted related tests to verify that long task descriptions are now truncated in the output. * refactor(subagents): update subagent registry path resolution and improve command output formatting - Replaced direct import of STATE_DIR with a utility function to resolve the state directory dynamically. - Enhanced the formatting of command output for active and recent subagents, adding separators for better readability. - Updated related tests to reflect changes in command output structure. * fix(subagent): default sessions_spawn to no timeout when runTimeoutSeconds omitted The previous fix (75a791106) correctly handled the case where runTimeoutSeconds was explicitly set to 0 ("no timeout"). However, when models omit the parameter entirely (which is common since the schema marks it as optional), runTimeoutSeconds resolved to undefined. undefined flowed through the chain as: sessions_spawn → timeout: undefined (since undefined != null is false) → gateway agent handler → agentCommand opts.timeout: undefined → resolveAgentTimeoutMs({ overrideSeconds: undefined }) → DEFAULT_AGENT_TIMEOUT_SECONDS (600s = 10 minutes) This caused subagents to be killed at exactly 10 minutes even though the user's intent (via TOOLS.md) was for subagents to run without a timeout. Fix: default runTimeoutSeconds to 0 (no timeout) when neither runTimeoutSeconds nor timeoutSeconds is provided by the caller. Subagent spawns are long-running by design and should not inherit the 600s agent-command default timeout. * fix(subagent): accept timeout=0 in agent-via-gateway path (second 600s default) * fix: thread timeout override through getReplyFromConfig dispatch path getReplyFromConfig called resolveAgentTimeoutMs({ cfg }) with no override, always falling back to the config default (600s). Add timeoutOverrideSeconds to GetReplyOptions and pass it through as overrideSeconds so callers of the dispatch chain can specify a custom timeout (0 = no timeout). This complements the existing timeout threading in agentCommand and the cron isolated-agent runner, which already pass overrideSeconds correctly. * feat(model-fallback): normalize OpenAI Codex model references and enhance fallback handling - Added normalization for OpenAI Codex model references, specifically converting "gpt-5.3-codex" to "openai-codex" before execution. - Updated the `resolveFallbackCandidates` function to utilize the new normalization logic. - Enhanced tests to verify the correct behavior of model normalization and fallback mechanisms. - Introduced a new test case to ensure that the normalization process works as expected for various input formats. * feat(tests): add unit tests for steer failure behavior in openclaw-tools - Introduced a new test file to validate the behavior of subagents when steer replacement dispatch fails. - Implemented tests to ensure that the announce behavior is restored correctly and that the suppression reason is cleared as expected. - Enhanced the subagent registry with a new function to clear steer restart suppression. - Updated related components to support the new test scenarios. * fix(subagents): replace stop command with kill in slash commands and documentation - Updated the `/subagents` command to replace `stop` with `kill` for consistency in controlling sub-agent runs. - Modified related documentation to reflect the change in command usage. - Removed legacy timeoutSeconds references from the sessions-spawn-tool schema and tests to streamline timeout handling. - Enhanced tests to ensure correct behavior of the updated commands and their interactions. * feat(tests): add unit tests for readLatestAssistantReply function - Introduced a new test file for the `readLatestAssistantReply` function to validate its behavior with various message scenarios. - Implemented tests to ensure the function correctly retrieves the latest assistant message and handles cases where the latest message has no text. - Mocked the gateway call to simulate different message histories for comprehensive testing. * feat(tests): enhance subagent kill-all cascade tests and announce formatting - Added a new test to verify that the `kill-all` command cascades through ended parents to active descendants in subagents. - Updated the subagent announce formatting tests to reflect changes in message structure, including the replacement of "Findings:" with "Result:" and the addition of new expectations for message content. - Improved the handling of long findings and stats in the announce formatting logic to ensure concise output. - Refactored related functions to enhance clarity and maintainability in the subagent registry and tools. * refactor(subagent): update announce formatting and remove unused constants - Modified the subagent announce formatting to replace "Findings:" with "Result:" and adjusted related expectations in tests. - Removed constants for maximum announce findings characters and summary words, simplifying the announcement logic. - Updated the handling of findings to retain full content instead of truncating, ensuring more informative outputs. - Cleaned up unused imports in the commands-subagents file to enhance code clarity. * feat(tests): enhance billing error handling in user-facing text - Added tests to ensure that normal text mentioning billing plans is not rewritten, preserving user context. - Updated the `isBillingErrorMessage` and `sanitizeUserFacingText` functions to improve handling of billing-related messages. - Introduced new test cases for various scenarios involving billing messages to ensure accurate processing and output. - Enhanced the subagent announce flow to correctly manage active descendant runs, preventing premature announcements. * feat(subagent): enhance workflow guidance and auto-announcement clarity - Added a new guideline in the subagent system prompt to emphasize trust in push-based completion, discouraging busy polling for status updates. - Updated documentation to clarify that sub-agents will automatically announce their results, improving user understanding of the workflow. - Enhanced tests to verify the new guidance on avoiding polling loops and to ensure the accuracy of the updated prompts. * fix(cron): avoid announcing interim subagent spawn acks * chore: clean post-rebase imports * fix(cron): fall back to child replies when parent stays interim * fix(subagents): make active-run guidance advisory * fix(subagents): update announce flow to handle active descendants and enhance test coverage - Modified the announce flow to defer announcements when active descendant runs are present, ensuring accurate status reporting. - Updated tests to verify the new behavior, including scenarios where no fallback requester is available and ensuring proper handling of finished subagents. - Enhanced the announce formatting to include an `expectFinal` flag for better clarity in the announcement process. * fix(subagents): enhance announce flow and formatting for user updates - Updated the announce flow to provide clearer instructions for user updates based on active subagent runs and requester context. - Refactored the announcement logic to improve clarity and ensure internal context remains private. - Enhanced tests to verify the new message expectations and formatting, including updated prompts for user-facing updates. - Introduced a new function to build reply instructions based on session context, improving the overall announcement process. * fix: resolve prep blockers and changelog placement (#14447) (thanks @tyler6204) * fix: restore cron delivery-plan import after rebase (#14447) (thanks @tyler6204) * fix: resolve test failures from rebase conflicts (#14447) (thanks @tyler6204) * fix: apply formatting after rebase (#14447) (thanks @tyler6204)	2026-02-14 22:03:45 -08:00
Vignesh Natarajan	eb846c95bf	fix (agents): classify empty-chunk stream failures as timeout	2026-02-14 18:54:03 -08:00
Peter Steinberger	d714ac7797	refactor(agents): dedupe transient error copy (#16324 )	2026-02-14 17:49:25 +01:00
Vincent	478af81706	Return user-facing message if API reuturn 429 API rate limit reached #2202 (#10415 ) * Return user-facing message if API reuturn 429 API rate limit reached * clarify the error message * fix(agents): improve 429 user messaging (#10415) (thanks @vincenthsin) --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>	2026-02-14 17:40:02 +01:00
Peter Steinberger	50a6e0e69e	fix: strip leading empty lines in sanitizeUserFacingText (#16280 ) * fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj) * fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj) * fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)	2026-02-14 16:34:02 +01:00

1 2

86 Commits