Commit Graph

72 Commits

Author SHA1 Message Date
Peter Steinberger 6472e03412 refactor(agents): share failover error matchers 2026-03-03 02:51:00 +00:00
AI南柯(KingMo) 30ab9b2068
fix(agents): recognize connection errors as retryable timeout failures (#31697)
* fix(agents): recognize connection errors as retryable timeout failures

## Problem

When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.

## Root Cause

Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)

While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.

## Solution

Extend timeout error patterns to include connection/network failures:

**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i

**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths

## Testing

Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text

## Impact

- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested

* style: fix code formatting for test file
2026-03-03 02:37:23 +00:00
Peter Steinberger 1bd20dbdb6 fix(failover): treat stop reason error as timeout 2026-03-03 01:05:24 +00:00
Peter Steinberger a2fdc3415f fix(failover): handle unhandled stop reason error 2026-03-03 01:05:24 +00:00
Sid 40e078a567
fix(auth): classify permission_error as auth_permanent for profile fallback (#31324)
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.

Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.

Closes #31306

Made-with: Cursor

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-03-01 22:26:05 -08:00
Frank Yang ed86252aa5
fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090)
* fix: handle CLI session expired errors gracefully

- Add session_expired to FailoverReason type
- Add isCliSessionExpiredErrorMessage to detect expired CLI sessions
- Modify runCliAgent to retry with new session when session expires
- Update agentCommand to clear expired session IDs from session store
- Add proper error handling to prevent gateway crashes on expired sessions

Fixes #30986

* fix: add session_expired to AuthProfileFailureReason and missing log import

* fix: type cli-runner usage field to match EmbeddedPiAgentMeta

* fix: harden CLI session-expiry recovery handling

* build: regenerate host env security policy swift

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-03-02 01:11:05 +00:00
Peter Steinberger 250f9e15f5 fix(agents): land #31007 from @HOYALIM
Co-authored-by: Ho Lim <subhoya@gmail.com>
2026-03-02 01:06:00 +00:00
Aleksandrs Tihenko c0026274d9
fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
Peter Steinberger d2597d5ecf fix(agents): harden model fallback failover paths 2026-02-25 03:46:34 +00:00
Peter Steinberger 43f318cd9a fix(agents): reduce billing false positives on long text (#25680)
Land PR #25680 from @lairtonlelis.
Retain explicit status/code/http 402 detection for oversized structured payloads.

Co-authored-by: Ailton <lairton@telnyx.com>
2026-02-25 01:22:17 +00:00
Peter Machona 9ced64054f
fix(auth): classify missing OAuth scopes as auth failures (#24761) 2026-02-24 03:33:44 +00:00
Clawborn 544809b6f6
Add Chinese context overflow patterns to isContextOverflowError (#22855)
Proxy providers returning Chinese error messages (e.g. Chinese LLM
gateways) use patterns like '上下文过长' or '上下文超出' that are not
matched by the existing English-only patterns in isContextOverflowError.
This prevents auto-compaction from triggering, leaving the session stuck.

Add the most common Chinese proxy patterns:
- 上下文过长 (context too long)
- 上下文超出 (context exceeded)
- 上下文长度超 (context length exceeds)
- 超出最大上下文 (exceeds maximum context)
- 请压缩上下文 (please compress context)

Chinese characters are unaffected by toLowerCase() so check the
original message directly.

Closes #22849
2026-02-23 10:54:24 -05:00
Vincent Koc 4f340b8812
fix(agents): avoid classifying reasoning-required errors as context overflow (#24593)
* Agents: exclude reasoning-required errors from overflow detection

* Tests: cover reasoning-required overflow classification guard

* Tests: format reasoning-required endpoint errors
2026-02-23 10:38:49 -05:00
Alice Losasso 652099cd5c
fix: correctly identify Groq TPM limits as rate limits instead of context overflow (#16176)
Co-authored-by: Howard <dddabtc@users.noreply.github.com>
2026-02-23 10:32:53 -05:00
青雲 69692d0d3a
fix: detect additional context overflow error patterns to prevent leak to user (#20539)
* fix: detect additional context overflow error patterns to prevent leak to user

Fixes #9951

The error 'input length and max_tokens exceed context limit: 170636 +
34048 > 200000' was not caught by isContextOverflowError() and leaked
to users via formatAssistantErrorText()'s invalidRequest fallback.

Add three new patterns to isContextOverflowError():
- 'exceed context limit' (direct match)
- 'exceeds the model\'s maximum context'
- max_tokens/input length + exceed + context (compound match)

These are now rewritten to the friendly context overflow message.

* Overflow: add regression tests and changelog credits

* Update CHANGELOG.md

* Update pi-embedded-helpers.isbillingerrormessage.test.ts

---------

Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 10:03:56 -05:00
Peter Steinberger 9bd04849ed fix(agents): detect Kimi model-token-limit overflows
Co-authored-by: Danilo Falcão <danilo@falcao.org>
2026-02-23 12:44:23 +00:00
taw0002 3c57bf4c85
fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
青雲 3dfee78d72
fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595) (#23698)
* fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595)

Mistral requires tool call IDs to be exactly 9 alphanumeric characters
([a-zA-Z0-9]{9}). The existing sanitizeToolCallIdsForCloudCodeAssist
mechanism only ran on historical messages at attempt start via
sanitizeSessionHistory, but the pi-agent-core agent loop's internal
tool call → tool result cycles bypassed that path entirely.

Changes:
- Wrap streamFn (like dropThinkingBlocks) so every outbound request
  sees sanitized tool call IDs when the transcript policy requires it
- Replace call_${Date.now()} in pendingToolCalls with a 9-char hex ID
  generated from crypto.randomBytes
- Add Mistral tool call ID error pattern to ERROR_PATTERNS.format so
  the error is correctly classified for retry/rotation

* Changelog: document Mistral strict9 tool-call ID fix

---------

Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-22 13:37:12 -05:00
Vignesh Natarajan 35fe33aa90 Agents: classify Anthropic api_error internal server failures for fallback 2026-02-21 19:22:16 -08:00
Harry Cui Kepler ffa63173e0
refactor(agents): migrate console.warn/error/info to subsystem logger (#22906)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: a806c4cb27
Co-authored-by: Kepler2024 <166882517+Kepler2024@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-21 17:11:47 -05:00
niceysam 5e423b596c
fix: remove false-positive billing error rewrite on normal assistant text (openclaw#17834) thanks @niceysam
Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: niceysam <256747835+niceysam@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-21 12:17:39 -06:00
mudrii 7ecfc1d93c
fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 2dee8e1174
Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com>
Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com>
Reviewed-by: @obviyus
2026-02-20 16:01:09 +05:30
Protocol Zero 2af3415fac
fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086)
* fix: treat HTTP 503 as failover-eligible for LLM provider errors

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.

This patch closes that gap:

- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
  (covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
  "high demand" (covers message-only classification when the leading
  status prefix is absent)

Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.

* fix: address review feedback — drop /\b503\b/ pattern, add test coverage

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
  semantic inconsistency noted by reviewers: `isTransientHttpError`
  already handles messages prefixed with "503" (→ "timeout"), so a
  redundant overloaded pattern would classify the same class of errors
  differently depending on message formatting.

- Keep "service unavailable" and "high demand" patterns — these are the
  real gap-fillers for SDK-rewritten messages that lack a numeric prefix.

- Add test case for JSON-wrapped 503 error body containing "overloaded"
  to strengthen coverage.

* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)

resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".

Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-19 12:45:09 -08:00
青雲 3d4ef56044
fix: include provider and model name in billing error message (#20510)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 40dbdf62e8
Co-authored-by: echoVic <16428813+echoVic@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-18 21:56:00 -05:00
Peter Steinberger 1934eebbf0 refactor(agents): dedupe lifecycle send assertions and stable payload stringify 2026-02-18 14:15:14 +00:00
Peter Steinberger b8b43175c5 style: align formatting with oxfmt 0.33 2026-02-18 01:34:35 +00:00
Peter Steinberger 31f9be126c style: run oxfmt and fix gate failures 2026-02-18 01:29:02 +00:00
cpojer d0cb8c19b2
chore: wtf. 2026-02-17 13:36:48 +09:00
Sebastian ed11e93cf2 chore(format) 2026-02-16 23:20:16 -05:00
cpojer 90ef2d6bdf
chore: Update formatting. 2026-02-17 09:18:40 +09:00
Daniel Sauer 12ce358da5 fix(failover): recognize 'abort' stop reason as timeout for model fallback
When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort'
due to stream interruption, OpenClaw's failover mechanism did not recognize
this as a timeout condition. This prevented fallback models from being
triggered, leaving users with failed requests instead of graceful failover.

Changes:
- Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts
- Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts

Fixes #18453

Co-authored-by: James <james@openclaw.ai>
2026-02-16 23:49:51 +01:00
Tyler Yust b8f66c260d
Agents: add nested subagent orchestration controls and reduce subagent token waste (#14447)
* Agents: add subagent orchestration controls

* Agents: add subagent orchestration controls (WIP uncommitted changes)

* feat(subagents): add depth-based spawn gating for sub-sub-agents

* feat(subagents): tool policy, registry, and announce chain for nested agents

* feat(subagents): system prompt, docs, changelog for nested sub-agents

* fix(subagents): prevent model fallback override, show model during active runs, and block context overflow fallback

Bug 1: When a session has an explicit model override (e.g., gpt/openai-codex),
the fallback candidate logic in resolveFallbackCandidates silently appended the
global primary model (opus) as a backstop. On reinjection/steer with a transient
error, the session could fall back to opus which has a smaller context window
and crash. Fix: when storedModelOverride is set, pass fallbacksOverride ?? []
instead of undefined, preventing the implicit primary backstop.

Bug 2: Active subagents showed 'model n/a' in /subagents list because
resolveModelDisplay only read entry.model/modelProvider (populated after run
completes). Fix: fall back to modelOverride/providerOverride fields which are
populated at spawn time via sessions.patch.

Bug 3: Context overflow errors (prompt too long, context_length_exceeded) could
theoretically escape runEmbeddedPiAgent and be treated as failover candidates
in runWithModelFallback, causing a switch to a model with a smaller context
window. Fix: in runWithModelFallback, detect context overflow errors via
isLikelyContextOverflowError and rethrow them immediately instead of trying the
next model candidate.

* fix(subagents): track spawn depth in session store and fix announce routing for nested agents

* Fix compaction status tracking and dedupe overflow compaction triggers

* fix(subagents): enforce depth block via session store and implement cascade kill

* fix: inject group chat context into system prompt

* fix(subagents): always write model to session store at spawn time

* Preserve spawnDepth when agent handler rewrites session entry

* fix(subagents): suppress announce on steer-restart

* fix(subagents): fallback spawned session model to runtime default

* fix(subagents): enforce spawn depth when caller key resolves by sessionId

* feat(subagents): implement active-first ordering for numeric targets and enhance task display

- Added a test to verify that subagents with numeric targets follow an active-first list ordering.
- Updated `resolveSubagentTarget` to sort subagent runs based on active status and recent activity.
- Enhanced task display in command responses to prevent truncation of long task descriptions.
- Introduced new utility functions for compacting task text and managing subagent run states.

* fix(subagents): show model for active runs via run record fallback

When the spawned model matches the agent's default model, the session
store's override fields are intentionally cleared (isDefault: true).
The model/modelProvider fields are only populated after the run
completes. This left active subagents showing 'model n/a'.

Fix: store the resolved model on SubagentRunRecord at registration
time, and use it as a fallback in both display paths (subagents tool
and /subagents command) when the session store entry has no model info.

Changes:
- SubagentRunRecord: add optional model field
- registerSubagentRun: accept and persist model param
- sessions-spawn-tool: pass resolvedModel to registerSubagentRun
- subagents-tool: pass run record model as fallback to resolveModelDisplay
- commands-subagents: pass run record model as fallback to resolveModelDisplay

* feat(chat): implement session key resolution and reset on sidebar navigation

- Added functions to resolve the main session key and reset chat state when switching sessions from the sidebar.
- Updated the `renderTab` function to handle session key changes when navigating to the chat tab.
- Introduced a test to verify that the session resets to "main" when opening chat from the sidebar navigation.

* fix: subagent timeout=0 passthrough and fallback prompt duplication

Bug 1: runTimeoutSeconds=0 now means 'no timeout' instead of applying 600s default
- sessions-spawn-tool: default to undefined (not 0) when neither timeout param
  is provided; use != null check so explicit 0 passes through to gateway
- agent.ts: accept 0 as valid timeout (resolveAgentTimeoutMs already handles
  0 → MAX_SAFE_TIMEOUT_MS)

Bug 2: model fallback no longer re-injects the original prompt as a duplicate
- agent.ts: track fallback attempt index; on retries use a short continuation
  message instead of the full original prompt since the session file already
  contains it from the first attempt
- Also skip re-sending images on fallback retries (already in session)

* feat(subagents): truncate long task descriptions in subagents command output

- Introduced a new utility function to format task previews, limiting their length to improve readability.
- Updated the command handler to use the new formatting function, ensuring task descriptions are truncated appropriately.
- Adjusted related tests to verify that long task descriptions are now truncated in the output.

* refactor(subagents): update subagent registry path resolution and improve command output formatting

- Replaced direct import of STATE_DIR with a utility function to resolve the state directory dynamically.
- Enhanced the formatting of command output for active and recent subagents, adding separators for better readability.
- Updated related tests to reflect changes in command output structure.

* fix(subagent): default sessions_spawn to no timeout when runTimeoutSeconds omitted

The previous fix (75a791106) correctly handled the case where
runTimeoutSeconds was explicitly set to 0 ("no timeout"). However,
when models omit the parameter entirely (which is common since the
schema marks it as optional), runTimeoutSeconds resolved to undefined.

undefined flowed through the chain as:
  sessions_spawn → timeout: undefined (since undefined != null is false)
  → gateway agent handler → agentCommand opts.timeout: undefined
  → resolveAgentTimeoutMs({ overrideSeconds: undefined })
  → DEFAULT_AGENT_TIMEOUT_SECONDS (600s = 10 minutes)

This caused subagents to be killed at exactly 10 minutes even though
the user's intent (via TOOLS.md) was for subagents to run without a
timeout.

Fix: default runTimeoutSeconds to 0 (no timeout) when neither
runTimeoutSeconds nor timeoutSeconds is provided by the caller.
Subagent spawns are long-running by design and should not inherit the
600s agent-command default timeout.

* fix(subagent): accept timeout=0 in agent-via-gateway path (second 600s default)

* fix: thread timeout override through getReplyFromConfig dispatch path

getReplyFromConfig called resolveAgentTimeoutMs({ cfg }) with no override,
always falling back to the config default (600s). Add timeoutOverrideSeconds
to GetReplyOptions and pass it through as overrideSeconds so callers of the
dispatch chain can specify a custom timeout (0 = no timeout).

This complements the existing timeout threading in agentCommand and the
cron isolated-agent runner, which already pass overrideSeconds correctly.

* feat(model-fallback): normalize OpenAI Codex model references and enhance fallback handling

- Added normalization for OpenAI Codex model references, specifically converting "gpt-5.3-codex" to "openai-codex" before execution.
- Updated the `resolveFallbackCandidates` function to utilize the new normalization logic.
- Enhanced tests to verify the correct behavior of model normalization and fallback mechanisms.
- Introduced a new test case to ensure that the normalization process works as expected for various input formats.

* feat(tests): add unit tests for steer failure behavior in openclaw-tools

- Introduced a new test file to validate the behavior of subagents when steer replacement dispatch fails.
- Implemented tests to ensure that the announce behavior is restored correctly and that the suppression reason is cleared as expected.
- Enhanced the subagent registry with a new function to clear steer restart suppression.
- Updated related components to support the new test scenarios.

* fix(subagents): replace stop command with kill in slash commands and documentation

- Updated the `/subagents` command to replace `stop` with `kill` for consistency in controlling sub-agent runs.
- Modified related documentation to reflect the change in command usage.
- Removed legacy timeoutSeconds references from the sessions-spawn-tool schema and tests to streamline timeout handling.
- Enhanced tests to ensure correct behavior of the updated commands and their interactions.

* feat(tests): add unit tests for readLatestAssistantReply function

- Introduced a new test file for the `readLatestAssistantReply` function to validate its behavior with various message scenarios.
- Implemented tests to ensure the function correctly retrieves the latest assistant message and handles cases where the latest message has no text.
- Mocked the gateway call to simulate different message histories for comprehensive testing.

* feat(tests): enhance subagent kill-all cascade tests and announce formatting

- Added a new test to verify that the `kill-all` command cascades through ended parents to active descendants in subagents.
- Updated the subagent announce formatting tests to reflect changes in message structure, including the replacement of "Findings:" with "Result:" and the addition of new expectations for message content.
- Improved the handling of long findings and stats in the announce formatting logic to ensure concise output.
- Refactored related functions to enhance clarity and maintainability in the subagent registry and tools.

* refactor(subagent): update announce formatting and remove unused constants

- Modified the subagent announce formatting to replace "Findings:" with "Result:" and adjusted related expectations in tests.
- Removed constants for maximum announce findings characters and summary words, simplifying the announcement logic.
- Updated the handling of findings to retain full content instead of truncating, ensuring more informative outputs.
- Cleaned up unused imports in the commands-subagents file to enhance code clarity.

* feat(tests): enhance billing error handling in user-facing text

- Added tests to ensure that normal text mentioning billing plans is not rewritten, preserving user context.
- Updated the `isBillingErrorMessage` and `sanitizeUserFacingText` functions to improve handling of billing-related messages.
- Introduced new test cases for various scenarios involving billing messages to ensure accurate processing and output.
- Enhanced the subagent announce flow to correctly manage active descendant runs, preventing premature announcements.

* feat(subagent): enhance workflow guidance and auto-announcement clarity

- Added a new guideline in the subagent system prompt to emphasize trust in push-based completion, discouraging busy polling for status updates.
- Updated documentation to clarify that sub-agents will automatically announce their results, improving user understanding of the workflow.
- Enhanced tests to verify the new guidance on avoiding polling loops and to ensure the accuracy of the updated prompts.

* fix(cron): avoid announcing interim subagent spawn acks

* chore: clean post-rebase imports

* fix(cron): fall back to child replies when parent stays interim

* fix(subagents): make active-run guidance advisory

* fix(subagents): update announce flow to handle active descendants and enhance test coverage

- Modified the announce flow to defer announcements when active descendant runs are present, ensuring accurate status reporting.
- Updated tests to verify the new behavior, including scenarios where no fallback requester is available and ensuring proper handling of finished subagents.
- Enhanced the announce formatting to include an `expectFinal` flag for better clarity in the announcement process.

* fix(subagents): enhance announce flow and formatting for user updates

- Updated the announce flow to provide clearer instructions for user updates based on active subagent runs and requester context.
- Refactored the announcement logic to improve clarity and ensure internal context remains private.
- Enhanced tests to verify the new message expectations and formatting, including updated prompts for user-facing updates.
- Introduced a new function to build reply instructions based on session context, improving the overall announcement process.

* fix: resolve prep blockers and changelog placement (#14447) (thanks @tyler6204)

* fix: restore cron delivery-plan import after rebase (#14447) (thanks @tyler6204)

* fix: resolve test failures from rebase conflicts (#14447) (thanks @tyler6204)

* fix: apply formatting after rebase (#14447) (thanks @tyler6204)
2026-02-14 22:03:45 -08:00
Vignesh Natarajan eb846c95bf fix (agents): classify empty-chunk stream failures as timeout 2026-02-14 18:54:03 -08:00
Peter Steinberger d714ac7797
refactor(agents): dedupe transient error copy (#16324) 2026-02-14 17:49:25 +01:00
Vincent 478af81706
Return user-facing message if API reuturn 429 API rate limit reached #2202 (#10415)
* Return user-facing message if API reuturn 429 API rate limit reached

* clarify the error message

* fix(agents): improve 429 user messaging (#10415) (thanks @vincenthsin)

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-02-14 17:40:02 +01:00
Peter Steinberger 50a6e0e69e
fix: strip leading empty lines in sanitizeUserFacingText (#16280)
* fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)

* fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)

* fix: strip leading empty lines in sanitizeUserFacingText (#16158) (thanks @mcinteerj)
2026-02-14 16:34:02 +01:00
Jake 3881af5b37
fix: strip leading whitespace from sanitizeUserFacingText output (#16158)
* fix: strip leading whitespace from sanitizeUserFacingText output

LLM responses frequently begin with \n\n, which survives through
sanitizeUserFacingText and reaches the channel as visible blank lines.

Root cause: the function used trimmed text for empty-checks but returned
the untrimmed 'stripped' variable. Two one-line fixes:
1. Return empty string (not whitespace-only 'stripped') for blank input
2. Apply trimStart() to the final return value

Fixes the same issue as #8052 and #10612 but at the root cause
(sanitizeUserFacingText) rather than scattering trimStart across
multiple delivery paths.

* Changelog: note sanitizeUserFacingText whitespace normalization

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-14 09:23:05 -06:00
Vladimir Peshekhonov 957b883082
fix(agents): stabilize overflow compaction retries and session context accounting (openclaw#14102) thanks @vpesh
Verified:
- CI checks for commit 86a7ecb45e
- Rebase conflict resolution for compatibility with latest main

Co-authored-by: vpesh <9496634+vpesh@users.noreply.github.com>
2026-02-12 17:53:13 -06:00
fagemx bdd0c12329
fix(providers): include provider name in billing error messages (#14697)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 774e0b6605
Co-authored-by: fagemx <117356295+fagemx@users.noreply.github.com>
Co-authored-by: shakkernerd <165377636+shakkernerd@users.noreply.github.com>
Reviewed-by: @shakkernerd
2026-02-12 18:23:27 +00:00
0xRain 4f329f923c
fix(agents): narrow billing error 402 regex to avoid false positives on issue IDs (#13827)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: b0501bbab7
Co-authored-by: 0xRaini <190923101+0xRaini@users.noreply.github.com>
Co-authored-by: sebslight <19554889+sebslight@users.noreply.github.com>
Reviewed-by: @sebslight
2026-02-12 09:18:06 -05:00
Rodrigo Uroz b912d3992d
(fix): handle Cloudflare 521 and transient 5xx errors gracefully (#13500)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: a8347e95c5
Co-authored-by: rodrigouroz <384037+rodrigouroz@users.noreply.github.com>
Co-authored-by: Takhoffman <781889+Takhoffman@users.noreply.github.com>
Reviewed-by: @Takhoffman
2026-02-11 21:42:33 -06:00
0xRain 729181bd06
fix(agents): exclude rate limit errors from context overflow classification (#13747)
Co-authored-by: 0xRaini <rain@0xRaini.dev>
2026-02-12 08:40:09 +09:00
Rami Abdelrazzaq c2b2d535fb
fix: suggest /clear in context overflow error message (#12973)
* fix: suggest /reset in context overflow error message

When the context window overflows, the error message now suggests
using /reset to clear session history, giving users an actionable
recovery path instead of a dead-end error.

Closes #12940

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: suggest /reset in context overflow error message (#12973) (thanks @RamiNoodle733)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Rami Abdelrazzaq <RamiNoodle733@users.noreply.github.com>
2026-02-09 20:44:37 -06:00
Tak Hoffman 54315aeacf
Agents: scope sanitizeUserFacingText rewrites to errorContext
Squash-merge #12988.

Refs: #12889 #12309 #3594 #7483 #10094 #10368 #11317 #11359 #11649 #12022 #12432 #12676 #12711
2026-02-09 19:52:24 -06:00
Shadow 8e607d927c Docs: require labeler + label updates for channels/extensions 2026-02-09 17:08:18 -08:00
Jabez Borja 8c73dbe705
fix(telegram): prevent false-positive billing error detection in conversation text (#12946) thanks @jabezborja 2026-02-09 19:49:31 -05:00
max 79c2466662
refactor: consolidate throwIfAborted + fix isCompactionFailureError (#12463)
* refactor: consolidate throwIfAborted in outbound module

- Create abort.ts with shared throwIfAborted helper

- Update deliver.ts, message-action-runner.ts, outbound-send-service.ts

* fix: handle context overflow in isCompactionFailureError without requiring colon
2026-02-09 00:32:57 -08:00
max f0924d3c4e
refactor: consolidate PNG encoder and safeParseJson utilities (#12457)
- Create shared PNG encoder module (src/media/png-encode.ts)

- Refactor qr-image.ts and live-image-probe.ts to use shared encoder

- Add safeParseJson to utils.ts and plugin-sdk exports

- Update msteams and pairing-store to use centralized safeParseJson
2026-02-09 00:21:54 -08:00
Tyler Yust e4651d6afa
Memory/QMD: reuse default model cache and skip ENOENT warnings (#12114)
* Memory/QMD: symlink default model cache into custom XDG_CACHE_HOME

QmdMemoryManager overrides XDG_CACHE_HOME to isolate the qmd index
per-agent, but this also moves where qmd looks for its ML models
(~2.1GB). Since models are installed at the default location
(~/.cache/qmd/models/), every qmd invocation would attempt to
re-download them from HuggingFace and time out.

Fix: on initialization, symlink ~/.cache/qmd/models/ into the custom
XDG_CACHE_HOME path so the index stays isolated per-agent while the
shared models are reused. The symlink is only created when the default
models directory exists and the target path does not already exist.

Includes tests for the three key scenarios: symlink creation, existing
directory preservation, and graceful skip when no default models exist.

* Memory/QMD: skip model symlink warning on ENOENT

* test: stabilize warning-filter visibility assertion (#12114) (thanks @tyler6204)

* fix: add changelog entry for QMD cache reuse (#12114) (thanks @tyler6204)

* fix: handle plain context-overflow strings in compaction detection (#12114) (thanks @tyler6204)
2026-02-08 23:43:08 -08:00
Stephen Brian King c984e6d8df
fix: prevent false positive context overflow detection in conversation text (#2078) 2026-02-08 23:22:57 -08:00