docs: refresh media auto-detect refs

2026-04-04 10:05:30 +01:00 · 2026-04-04 10:05:30 +01:00 · 2a5da613f4
parent 459ede5a7e
commit 2a5da613f4
3 changed files with 31 additions and 14 deletions
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@ -23,12 +23,15 @@ title: "Audio and Voice Notes"
 If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`,
 OpenClaw auto-detects in this order and stops at the first working option:

-1. **Local CLIs** (if installed)
+1. **Active reply model** when its provider supports audio understanding.
+2. **Local CLIs** (if installed)
   - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
   - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
   - `whisper` (Python CLI; downloads models automatically)
-2. **Gemini CLI** (`gemini`) using `read_many_files`
-3. **Provider keys** (OpenAI → Groq → Deepgram → Google)
+3. **Gemini CLI** (`gemini`) using `read_many_files`
+4. **Provider auth**
+   - Configured `models.providers.*` entries that support audio are tried first
+   - Bundled fallback order: OpenAI → Groq → Deepgram → Google → Mistral

 To disable auto-detection, set `tools.media.audio.enabled: false`.
 To customize, set `tools.media.audio.models`.
--- a/docs/nodes/images.md
+++ b/docs/nodes/images.md
@ -48,6 +48,7 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
 - Media understanding (if configured via `tools.media.*` or shared `tools.media.models`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
  - Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
  - Video and image descriptions preserve any caption text for command parsing.
+  - If the active primary image model already supports vision natively, OpenClaw skips the `[Image]` summary block and passes the original image to the model instead.
 - By default only the first matching image/audio/video attachment is processed; set `tools.media.<cap>.attachments` to process multiple attachments.

 ## Limits & Errors
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@ -133,6 +133,9 @@ Rules:
 - Audio files smaller than **1024 bytes** are treated as empty/corrupt and skipped before provider/CLI transcription.
 - If the model returns more than `maxChars`, output is trimmed.
 - `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
+- If the active primary image model already supports vision natively, OpenClaw
+  skips the `[Image]` summary block and passes the original image into the
+  model instead.
 - If `<capability>.enabled: true` but no models are configured, OpenClaw tries the
  **active reply model** when its provider supports the capability.

@ -142,15 +145,22 @@ If `tools.media.<capability>.enabled` is **not** set to `false` and you haven’
 configured models, OpenClaw auto-detects in this order and **stops at the first
 working option**:

-1. **Local CLIs** (audio only; if installed)
+1. **Active reply model** when its provider supports the capability.
+2. **`agents.defaults.imageModel`** primary/fallback refs (image only).
+3. **Local CLIs** (audio only; if installed)
   - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
   - `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
   - `whisper` (Python CLI; downloads models automatically)
-2. **Gemini CLI** (`gemini`) using `read_many_files`
-3. **Provider keys**
-   - Audio: OpenAI → Groq → Deepgram → Google → Mistral
-   - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
-   - Video: Google
+4. **Gemini CLI** (`gemini`) using `read_many_files`
+5. **Provider auth**
+   - Configured `models.providers.*` entries that support the capability are
+     tried before the bundled fallback order.
+   - Image-only config providers with an image-capable model auto-register for
+     media understanding even when they are not a bundled vendor plugin.
+   - Bundled fallback order:
+     - Audio: OpenAI → Groq → Deepgram → Google → Mistral
+     - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
+     - Video: Google → Moonshot

 To disable auto-detection, set:

@ -190,22 +200,25 @@ lists, OpenClaw can infer defaults:
 - `openai`, `anthropic`, `minimax`: **image**
 - `minimax-portal`: **image**
 - `moonshot`: **image + video**
+- `openrouter`: **image**
 - `google` (Gemini API): **image + audio + video**
 - `mistral`: **audio**
 - `zai`: **image**
 - `groq`: **audio**
 - `deepgram`: **audio**
+- Any `models.providers.<id>.models[]` catalog with an image-capable model:
+  **image**

 For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
 If you omit `capabilities`, the entry is eligible for the list it appears in.

 ## Provider support matrix (OpenClaw integrations)

-| Capability | Provider integration                               | Notes                                                                   |
-| ---------- | -------------------------------------------------- | ----------------------------------------------------------------------- |
-| Image      | OpenAI, Anthropic, Google, MiniMax, Moonshot, Z.AI | Vendor plugins register image support against core media understanding. |
-| Audio      | OpenAI, Groq, Deepgram, Google, Mistral            | Provider transcription (Whisper/Deepgram/Gemini/Voxtral).               |
-| Video      | Google, Moonshot                                   | Provider video understanding via vendor plugins.                        |
+| Capability | Provider integration                                                             | Notes                                                                                |
+| ---------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| Image      | OpenAI, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Z.AI, config providers | Vendor plugins register image support; image-capable config providers auto-register. |
+| Audio      | OpenAI, Groq, Deepgram, Google, Mistral                                          | Provider transcription (Whisper/Deepgram/Gemini/Voxtral).                            |
+| Video      | Google, Moonshot                                                                 | Provider video understanding via vendor plugins.                                     |

 ## Model selection guidance