Commit Graph

14 Commits

Author SHA1 Message Date
Aleksandrs Tihenko c0026274d9
fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
taw0002 3c57bf4c85
fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
mudrii 7ecfc1d93c
fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 2dee8e1174
Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com>
Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com>
Reviewed-by: @obviyus
2026-02-20 16:01:09 +05:30
Protocol Zero 2af3415fac
fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086)
* fix: treat HTTP 503 as failover-eligible for LLM provider errors

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.

This patch closes that gap:

- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
  (covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
  "high demand" (covers message-only classification when the leading
  status prefix is absent)

Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.

* fix: address review feedback — drop /\b503\b/ pattern, add test coverage

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
  semantic inconsistency noted by reviewers: `isTransientHttpError`
  already handles messages prefixed with "503" (→ "timeout"), so a
  redundant overloaded pattern would classify the same class of errors
  differently depending on message formatting.

- Keep "service unavailable" and "high demand" patterns — these are the
  real gap-fillers for SDK-rewritten messages that lack a numeric prefix.

- Add test case for JSON-wrapped 503 error body containing "overloaded"
  to strengthen coverage.

* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)

resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".

Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-19 12:45:09 -08:00
Sebastian fbda9a93fd fix(failover): align abort timeout detection and regressions 2026-02-16 21:00:27 -05:00
Daniel Sauer 12ce358da5 fix(failover): recognize 'abort' stop reason as timeout for model fallback
When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort'
due to stream interruption, OpenClaw's failover mechanism did not recognize
this as a timeout condition. This prevented fallback models from being
triggered, leaving users with failed requests instead of graceful failover.

Changes:
- Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts
- Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts

Fixes #18453

Co-authored-by: James <james@openclaw.ai>
2026-02-16 23:49:51 +01:00
Oren 71b4be8799
fix: handle 400 status in failover to enable model fallback (#1879) 2026-02-08 23:12:06 -08:00
cpojer 5ceff756e1
chore: Enable "curly" rule to avoid single-statement if confusion/errors. 2026-01-31 16:19:20 +09:00
Luke be1cdc9370
fix(agents): treat provider request-aborted as timeout for fallback (#1576)
* fix(agents): treat request-aborted as timeout for fallback

* test(e2e): add provider timeout fallback
2026-01-24 11:27:24 +00:00
Peter Steinberger ec27c813cc fix(fallback): handle timeout aborts
Co-authored-by: Mykyta Bozhenko <21245729+cheeeee@users.noreply.github.com>
2026-01-18 07:52:44 +00:00
Peter Steinberger c379191f80 chore: migrate to oxlint and oxfmt
Co-authored-by: Christoph Nakazawa <christoph.pojer@gmail.com>
2026-01-14 15:02:19 +00:00
Peter Steinberger 53ec8e36cb refactor: centralize failover error parsing 2026-01-10 01:26:06 +01:00
Peter Steinberger 402c35b91c refactor(agents): centralize failover normalization 2026-01-09 22:15:06 +01:00
Peter Steinberger c27b1441f7 fix(auth): billing backoff + cooldown UX 2026-01-09 22:00:14 +01:00