docs: expand model fallback guide

2026-04-04 09:13:33 +09:00 · 2026-04-04 09:13:33 +09:00 · d01cb5ecc6
parent 5bea93fd63
commit d01cb5ecc6
2 changed files with 137 additions and 0 deletions
--- a/docs/concepts/model-failover.md
+++ b/docs/concepts/model-failover.md
@ -3,6 +3,7 @@ summary: "How OpenClaw rotates auth profiles and falls back across models"
 read_when:
  - Diagnosing auth profile rotation, cooldowns, or model fallback behavior
  - Updating failover rules for auth profiles or models
+  - Understanding how session model overrides interact with fallback retries
 title: "Model Failover"
 ---

@ -15,6 +16,44 @@ OpenClaw handles failures in two stages:

 This doc explains the runtime rules and the data that backs them.

+## Runtime flow
+
+For a normal text run, OpenClaw evaluates candidates in this order:
+
+1. The currently selected session model.
+2. Configured `agents.defaults.model.fallbacks` in order.
+3. The configured primary model at the end when the run started from an override.
+
+Inside each candidate, OpenClaw tries auth-profile failover before advancing to
+the next model candidate.
+
+High-level sequence:
+
+1. Resolve the active session model and auth-profile preference.
+2. Build the model candidate chain.
+3. Try the current provider with auth-profile rotation/cooldown rules.
+4. If that provider is exhausted with a failover-worthy error, move to the next
+   model candidate.
+5. Persist the selected fallback override before the retry starts so other
+   session readers see the same provider/model the runner is about to use.
+6. If the fallback candidate fails, roll back only the fallback-owned session
+   override fields when they still match that failed candidate.
+7. If every candidate fails, throw a `FallbackSummaryError` with per-attempt
+   detail and the soonest cooldown expiry when one is known.
+
+This is intentionally narrower than "save and restore the whole session". The
+reply runner only persists the model-selection fields it owns for fallback:
+
+- `providerOverride`
+- `modelOverride`
+- `authProfileOverride`
+- `authProfileOverrideSource`
+- `authProfileOverrideCompactionCount`
+
+That prevents a failed fallback retry from overwriting newer unrelated session
+mutations such as manual `/model` changes or session rotation updates that
+happened while the attempt was running.
+
 ## Auth storage (keys + OAuth)

 OpenClaw uses **auth profiles** for both API keys and OAuth tokens.
@ -148,6 +187,102 @@ with `auth.cooldowns.overloadedProfileRotations`,
 When a run starts with a model override (hooks or CLI), fallbacks still end at
 `agents.defaults.model.primary` after trying any configured fallbacks.

+### Candidate chain rules
+
+OpenClaw builds the candidate list from the currently requested `provider/model`
+plus configured fallbacks.
+
+Rules:
+
+- The requested model is always first.
+- Explicit configured fallbacks are deduplicated but not filtered by the model
+  allowlist. They are treated as explicit operator intent.
+- If the current run is already on a configured fallback in the same provider
+  family, OpenClaw keeps using the full configured chain.
+- If the current run is on a different provider than config and that current
+  model is not already part of the configured fallback chain, OpenClaw does not
+  append unrelated configured fallbacks from another provider.
+- When the run started from an override, the configured primary is appended at
+  the end so the chain can settle back onto the normal default once earlier
+  candidates are exhausted.
+
+### Which errors advance fallback
+
+Model fallback continues on:
+
+- auth failures
+- rate limits and cooldown exhaustion
+- overloaded/provider-busy errors
+- timeout-shaped failover errors
+- billing disables
+- `LiveSessionModelSwitchError`, which is normalized into a failover path so a
+  stale persisted model does not create an outer retry loop
+- other unrecognized errors when there are still remaining candidates
+
+Model fallback does not continue on:
+
+- explicit aborts that are not timeout/failover-shaped
+- context overflow errors that should stay inside compaction/retry logic
+- a final unknown error when there are no candidates left
+
+### Cooldown skip vs probe behavior
+
+When every auth profile for a provider is already in cooldown, OpenClaw does
+not automatically skip that provider forever. It makes a per-candidate decision:
+
+- Persistent auth failures skip the whole provider immediately.
+- Billing disables usually skip, but the primary candidate can still be probed
+  on a throttle so recovery is possible without restarting.
+- The primary candidate may be probed near cooldown expiry, with a per-provider
+  throttle.
+- Same-provider fallback siblings can be attempted despite cooldown when the
+  failure looks transient (`rate_limit`, `overloaded`, or unknown).
+- Transient cooldown probes are limited to one per provider per fallback run so
+  a single provider does not stall cross-provider fallback.
+
+## Session overrides and live model switching
+
+Session model changes are shared state. The active runner, `/model` command,
+compaction/session updates, and live-session reconciliation all read or write
+parts of the same session entry.
+
+That means fallback retries have to coordinate with live model switching:
+
+- Before a fallback retry starts, the reply runner persists the selected
+  fallback override fields to the session entry.
+- Live-session reconciliation prefers persisted session overrides over stale
+  runtime model fields.
+- If the fallback attempt fails, the runner rolls back only the override fields
+  it wrote, and only if they still match that failed candidate.
+
+This prevents the classic race:
+
+1. Primary fails.
+2. Fallback candidate is chosen in memory.
+3. Session store still says the old primary.
+4. Live-session reconciliation reads the stale session state.
+5. The retry gets snapped back to the old model before the fallback attempt
+   starts.
+
+The persisted fallback override closes that window, and the narrow rollback
+keeps newer manual or runtime session changes intact.
+
+## Observability and failure summaries
+
+`runWithModelFallback(...)` records per-attempt details that feed logs and
+user-facing cooldown messaging:
+
+- provider/model attempted
+- reason (`rate_limit`, `overloaded`, `billing`, `auth`, `model_not_found`, and
+  similar failover reasons)
+- optional status/code
+- human-readable error summary
+
+When every candidate fails, OpenClaw throws `FallbackSummaryError`. The outer
+reply runner can use that to build a more specific message such as "all models
+are temporarily rate-limited" and include the soonest cooldown expiry when one
+is known.
+
 ## Related config

 See [Gateway configuration](/gateway/configuration) for:
--- a/docs/concepts/model-providers.md
+++ b/docs/concepts/model-providers.md
@ -16,6 +16,8 @@ For model selection rules, see [/concepts/models](/concepts/models).
 - Model refs use `provider/model` (example: `opencode/claude-opus-4-6`).
 - If you set `agents.defaults.models`, it becomes the allowlist.
 - CLI helpers: `openclaw onboard`, `openclaw models list`, `openclaw models set <provider/model>`.
+- Fallback runtime rules, cooldown probes, and session-override persistence are
+  documented in [/concepts/model-failover](/concepts/model-failover).
 - Provider plugins can inject model catalogs via `registerProvider({ catalog })`;
  OpenClaw merges that output into `models.providers` before writing
  `models.json`.