From 2a5da613f4293ed01356703f520c230eb7719fb2 Mon Sep 17 00:00:00 2001
From: Peter Steinberger <steipete@gmail.com>
Date: Sat, 4 Apr 2026 10:05:30 +0100
Subject: [PATCH] docs: refresh media auto-detect refs

---
 docs/nodes/audio.md               |  9 +++++---
 docs/nodes/images.md              |  1 +
 docs/nodes/media-understanding.md | 35 +++++++++++++++++++++----------
 3 files changed, 31 insertions(+), 14 deletions(-)
diff --git a/docs/nodes/audio.md b/docs/nodes/audio.md
index 19bf814fb32..8a989e2ccb7 100644
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -23,12 +23,15 @@ title: "Audio and Voice Notes"
 If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`,
 OpenClaw auto-detects in this order and stops at the first working option:
 
-1. **Local CLIs** (if installed)
+1. **Active reply model** when its provider supports audio understanding.
+2. **Local CLIs** (if installed)
    - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
    - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
    - `whisper` (Python CLI; downloads models automatically)
-2. **Gemini CLI** (`gemini`) using `read_many_files`
-3. **Provider keys** (OpenAI → Groq → Deepgram → Google)
+3. **Gemini CLI** (`gemini`) using `read_many_files`
+4. **Provider auth**
+   - Configured `models.providers.*` entries that support audio are tried first
+   - Bundled fallback order: OpenAI → Groq → Deepgram → Google → Mistral
 
 To disable auto-detection, set `tools.media.audio.enabled: false`.
 To customize, set `tools.media.audio.models`.
diff --git a/docs/nodes/images.md b/docs/nodes/images.md
index ab3d228fa74..a4f0a051da0 100644
--- a/docs/nodes/images.md
+++ b/docs/nodes/images.md
@@ -48,6 +48,7 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
 - Media understanding (if configured via `tools.media.*` or shared `tools.media.models`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
   - Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
   - Video and image descriptions preserve any caption text for command parsing.
+  - If the active primary image model already supports vision natively, OpenClaw skips the `[Image]` summary block and passes the original image to the model instead.
 - By default only the first matching image/audio/video attachment is processed; set `tools.media.<cap>.attachments` to process multiple attachments.
 
 ## Limits & Errors
diff --git a/docs/nodes/media-understanding.md b/docs/nodes/media-understanding.md
index 01956134914..186c5f6a70f 100644
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -133,6 +133,9 @@ Rules:
 - Audio files smaller than **1024 bytes** are treated as empty/corrupt and skipped before provider/CLI transcription.
 - If the model returns more than `maxChars`, output is trimmed.
 - `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
+- If the active primary image model already supports vision natively, OpenClaw
+  skips the `[Image]` summary block and passes the original image into the
+  model instead.
 - If `<capability>.enabled: true` but no models are configured, OpenClaw tries the
   **active reply model** when its provider supports the capability.
 
@@ -142,15 +145,22 @@ If `tools.media.<capability>.enabled` is **not** set to `false` and you haven’
 configured models, OpenClaw auto-detects in this order and **stops at the first
 working option**:
 
-1. **Local CLIs** (audio only; if installed)
+1. **Active reply model** when its provider supports the capability.
+2. **`agents.defaults.imageModel`** primary/fallback refs (image only).
+3. **Local CLIs** (audio only; if installed)
    - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
    - `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
    - `whisper` (Python CLI; downloads models automatically)
-2. **Gemini CLI** (`gemini`) using `read_many_files`
-3. **Provider keys**
-   - Audio: OpenAI → Groq → Deepgram → Google → Mistral
-   - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
-   - Video: Google
+4. **Gemini CLI** (`gemini`) using `read_many_files`
+5. **Provider auth**
+   - Configured `models.providers.*` entries that support the capability are
+     tried before the bundled fallback order.
+   - Image-only config providers with an image-capable model auto-register for
+     media understanding even when they are not a bundled vendor plugin.
+   - Bundled fallback order:
+     - Audio: OpenAI → Groq → Deepgram → Google → Mistral
+     - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
+     - Video: Google → Moonshot
 
 To disable auto-detection, set:
 
@@ -190,22 +200,25 @@ lists, OpenClaw can infer defaults:
 - `openai`, `anthropic`, `minimax`: **image**
 - `minimax-portal`: **image**
 - `moonshot`: **image + video**
+- `openrouter`: **image**
 - `google` (Gemini API): **image + audio + video**
 - `mistral`: **audio**
 - `zai`: **image**
 - `groq`: **audio**
 - `deepgram`: **audio**
+- Any `models.providers.<id>.models[]` catalog with an image-capable model:
+  **image**
 
 For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
 If you omit `capabilities`, the entry is eligible for the list it appears in.
 
 ## Provider support matrix (OpenClaw integrations)
 
-| Capability | Provider integration                               | Notes                                                                   |
-| ---------- | -------------------------------------------------- | ----------------------------------------------------------------------- |
-| Image      | OpenAI, Anthropic, Google, MiniMax, Moonshot, Z.AI | Vendor plugins register image support against core media understanding. |
-| Audio      | OpenAI, Groq, Deepgram, Google, Mistral            | Provider transcription (Whisper/Deepgram/Gemini/Voxtral).               |
-| Video      | Google, Moonshot                                   | Provider video understanding via vendor plugins.                        |
+| Capability | Provider integration                                                             | Notes                                                                                |
+| ---------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| Image      | OpenAI, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Z.AI, config providers | Vendor plugins register image support; image-capable config providers auto-register. |
+| Audio      | OpenAI, Groq, Deepgram, Google, Mistral                                          | Provider transcription (Whisper/Deepgram/Gemini/Voxtral).                            |
+| Video      | Google, Moonshot                                                                 | Provider video understanding via vendor plugins.                                     |
 
 ## Model selection guidance