From 2a5da613f4293ed01356703f520c230eb7719fb2 Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Sat, 4 Apr 2026 10:05:30 +0100 Subject: [PATCH] docs: refresh media auto-detect refs --- docs/nodes/audio.md | 9 +++++--- docs/nodes/images.md | 1 + docs/nodes/media-understanding.md | 35 +++++++++++++++++++++---------- 3 files changed, 31 insertions(+), 14 deletions(-) diff --git a/docs/nodes/audio.md b/docs/nodes/audio.md index 19bf814fb32..8a989e2ccb7 100644 --- a/docs/nodes/audio.md +++ b/docs/nodes/audio.md @@ -23,12 +23,15 @@ title: "Audio and Voice Notes" If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`, OpenClaw auto-detects in this order and stops at the first working option: -1. **Local CLIs** (if installed) +1. **Active reply model** when its provider supports audio understanding. +2. **Local CLIs** (if installed) - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens) - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model) - `whisper` (Python CLI; downloads models automatically) -2. **Gemini CLI** (`gemini`) using `read_many_files` -3. **Provider keys** (OpenAI → Groq → Deepgram → Google) +3. **Gemini CLI** (`gemini`) using `read_many_files` +4. **Provider auth** + - Configured `models.providers.*` entries that support audio are tried first + - Bundled fallback order: OpenAI → Groq → Deepgram → Google → Mistral To disable auto-detection, set `tools.media.audio.enabled: false`. To customize, set `tools.media.audio.models`. diff --git a/docs/nodes/images.md b/docs/nodes/images.md index ab3d228fa74..a4f0a051da0 100644 --- a/docs/nodes/images.md +++ b/docs/nodes/images.md @@ -48,6 +48,7 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren - Media understanding (if configured via `tools.media.*` or shared `tools.media.models`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`. - Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work. - Video and image descriptions preserve any caption text for command parsing. + - If the active primary image model already supports vision natively, OpenClaw skips the `[Image]` summary block and passes the original image to the model instead. - By default only the first matching image/audio/video attachment is processed; set `tools.media..attachments` to process multiple attachments. ## Limits & Errors diff --git a/docs/nodes/media-understanding.md b/docs/nodes/media-understanding.md index 01956134914..186c5f6a70f 100644 --- a/docs/nodes/media-understanding.md +++ b/docs/nodes/media-understanding.md @@ -133,6 +133,9 @@ Rules: - Audio files smaller than **1024 bytes** are treated as empty/corrupt and skipped before provider/CLI transcription. - If the model returns more than `maxChars`, output is trimmed. - `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only). +- If the active primary image model already supports vision natively, OpenClaw + skips the `[Image]` summary block and passes the original image into the + model instead. - If `.enabled: true` but no models are configured, OpenClaw tries the **active reply model** when its provider supports the capability. @@ -142,15 +145,22 @@ If `tools.media..enabled` is **not** set to `false` and you haven’ configured models, OpenClaw auto-detects in this order and **stops at the first working option**: -1. **Local CLIs** (audio only; if installed) +1. **Active reply model** when its provider supports the capability. +2. **`agents.defaults.imageModel`** primary/fallback refs (image only). +3. **Local CLIs** (audio only; if installed) - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens) - `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model) - `whisper` (Python CLI; downloads models automatically) -2. **Gemini CLI** (`gemini`) using `read_many_files` -3. **Provider keys** - - Audio: OpenAI → Groq → Deepgram → Google → Mistral - - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI - - Video: Google +4. **Gemini CLI** (`gemini`) using `read_many_files` +5. **Provider auth** + - Configured `models.providers.*` entries that support the capability are + tried before the bundled fallback order. + - Image-only config providers with an image-capable model auto-register for + media understanding even when they are not a bundled vendor plugin. + - Bundled fallback order: + - Audio: OpenAI → Groq → Deepgram → Google → Mistral + - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI + - Video: Google → Moonshot To disable auto-detection, set: @@ -190,22 +200,25 @@ lists, OpenClaw can infer defaults: - `openai`, `anthropic`, `minimax`: **image** - `minimax-portal`: **image** - `moonshot`: **image + video** +- `openrouter`: **image** - `google` (Gemini API): **image + audio + video** - `mistral`: **audio** - `zai`: **image** - `groq`: **audio** - `deepgram`: **audio** +- Any `models.providers..models[]` catalog with an image-capable model: + **image** For CLI entries, **set `capabilities` explicitly** to avoid surprising matches. If you omit `capabilities`, the entry is eligible for the list it appears in. ## Provider support matrix (OpenClaw integrations) -| Capability | Provider integration | Notes | -| ---------- | -------------------------------------------------- | ----------------------------------------------------------------------- | -| Image | OpenAI, Anthropic, Google, MiniMax, Moonshot, Z.AI | Vendor plugins register image support against core media understanding. | -| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). | -| Video | Google, Moonshot | Provider video understanding via vendor plugins. | +| Capability | Provider integration | Notes | +| ---------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | +| Image | OpenAI, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Z.AI, config providers | Vendor plugins register image support; image-capable config providers auto-register. | +| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). | +| Video | Google, Moonshot | Provider video understanding via vendor plugins. | ## Model selection guidance