docs(guardian): improve README with quick start, default config values, and block behavior

- Replace Enable/Config sections with Quick start (bundled plugin, no npm install)
- Show all default values in config example
- Add "When a tool call is blocked" section explaining user flow
- Remove Model selection section
- Fix dead anchor link

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
ShengtongZhu 2026-03-15 11:20:44 +08:00
parent 6a3220b0c6
commit 8972213aee
1 changed files with 96 additions and 57 deletions

View File

@ -28,7 +28,10 @@ The guardian uses a **dual-hook architecture**:
2. **`before_tool_call` hook** — lazily extracts the latest conversation context
(including tool results like `memory_search`) and sends it to the guardian LLM
## Enable
## Quick start
Guardian is a bundled plugin — no separate install needed. Just enable it in
`~/.openclaw/openclaw.json`:
```json
{
@ -40,9 +43,7 @@ The guardian uses a **dual-hook architecture**:
}
```
If no `model` is configured, the guardian uses the main agent model.
## Config
For better resilience, use a **different provider** than your main model:
```json
{
@ -51,8 +52,52 @@ If no `model` is configured, the guardian uses the main agent model.
"guardian": {
"enabled": true,
"config": {
"model": "openai/gpt-4o-mini",
"mode": "enforce"
"model": "anthropic/claude-opus-4-20250514"
}
}
}
}
}
```
## Config
All options with their **default values**:
```json
{
"plugins": {
"entries": {
"guardian": {
"enabled": true,
"config": {
"mode": "enforce",
"watched_tools": [
"message_send",
"message",
"exec",
"write_file",
"Write",
"edit",
"gateway",
"gateway_config",
"cron",
"cron_add"
],
"context_tools": [
"memory_search",
"memory_get",
"memory_recall",
"read",
"exec",
"web_fetch",
"web_search"
],
"timeout_ms": 20000,
"fallback_on_error": "allow",
"log_decisions": true,
"max_arg_length": 500,
"max_recent_turns": 3
}
}
}
@ -62,19 +107,17 @@ If no `model` is configured, the guardian uses the main agent model.
### All options
| Option | Type | Default | Description |
| ------------------------ | ------------------------ | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model` | string | _(main model)_ | Guardian model in `provider/model` format (e.g. `"openai/gpt-4o-mini"`, `"kimi/moonshot-v1-8k"`, `"ollama/llama3.1:8b"`). A small, cheap model is recommended — the guardian only makes a binary ALLOW/BLOCK decision. |
| `mode` | `"enforce"` \| `"audit"` | `"enforce"` | `enforce` blocks disallowed calls. `audit` logs decisions without blocking — useful for initial evaluation. |
| `watched_tools` | string[] | See below | Tool names that require guardian review. Tools not in this list are always allowed. |
| `timeout_ms` | number | `20000` | Max wait for guardian API response (ms). |
| `fallback_on_error` | `"allow"` \| `"block"` | `"allow"` | What to do when the guardian API fails or times out. |
| `log_decisions` | boolean | `true` | Log all ALLOW/BLOCK decisions. BLOCK decisions are logged with full conversation context. |
| `max_user_messages` | number | `10` | Number of conversation turns fed to the summarizer (history window). |
| `max_arg_length` | number | `500` | Max characters of tool arguments JSON to include (truncated). |
| `max_recent_turns` | number | `3` | Number of recent raw conversation turns to keep in the guardian prompt alongside the rolling summary. |
| `context_tools` | string[] | See below | Tool names whose results are included in the guardian's conversation context. Only results from these tools are fed to the guardian — others are filtered out to save tokens. |
| `max_tool_result_length` | number | `300` | Max characters per tool result snippet included in the guardian context. |
| Option | Type | Default | Description |
| ------------------- | ------------------------ | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model` | string | _(main model)_ | Guardian model in `provider/model` format (e.g. `"openai/gpt-4o-mini"`, `"kimi/moonshot-v1-8k"`, `"ollama/llama3.1:8b"`). The guardian only makes a binary ALLOW/BLOCK decision. |
| `mode` | `"enforce"` \| `"audit"` | `"enforce"` | `enforce` blocks disallowed calls. `audit` logs decisions without blocking — useful for initial evaluation. |
| `watched_tools` | string[] | See below | Tool names that require guardian review. Tools not in this list are always allowed. |
| `timeout_ms` | number | `20000` | Max wait for guardian API response (ms). |
| `fallback_on_error` | `"allow"` \| `"block"` | `"allow"` | What to do when the guardian API fails or times out. |
| `log_decisions` | boolean | `true` | Log all ALLOW/BLOCK decisions. BLOCK decisions are logged with full conversation context. |
| `max_arg_length` | number | `500` | Max characters of tool arguments JSON to include (truncated). |
| `max_recent_turns` | number | `3` | Number of recent raw conversation turns to keep in the guardian prompt alongside the rolling summary. |
| `context_tools` | string[] | See below | Tool names whose results are included in the guardian's conversation context. Only results from these tools are fed to the guardian — others are filtered out to save tokens. |
### Default watched tools
@ -115,12 +158,14 @@ context for the guardian's decisions.
## Getting started
**Step 1** — Start with audit mode to observe decisions without blocking:
**Step 1** — Install and enable with defaults (see [Quick start](#quick-start)).
**Step 2** — Optionally start with audit mode to observe decisions without
blocking:
```json
{
"config": {
"model": "openai/gpt-4o-mini",
"mode": "audit"
}
}
@ -129,50 +174,45 @@ context for the guardian's decisions.
Check logs for `[guardian] AUDIT-ONLY (would block)` entries and verify the
decisions are reasonable.
**Step 2** — Switch to enforce mode:
**Step 3** — Switch to `"enforce"` mode (the default) once you're satisfied.
```json
{
"config": {
"model": "openai/gpt-4o-mini",
"mode": "enforce"
}
}
```
**Step 3** — Adjust `watched_tools` if needed. Remove tools that produce too
**Step 4** — Adjust `watched_tools` if needed. Remove tools that produce too
many false positives, or add custom tools that need protection.
## Model selection
## When a tool call is blocked
The guardian makes a simple binary decision (ALLOW/BLOCK) for each tool call.
A small, fast model is sufficient and keeps cost low.
When the guardian blocks a tool call, the agent receives a tool error containing
the block reason (e.g. `"Guardian: user never requested file deletion"`). The
agent will then inform the user that the action was blocked and why.
**Use a different provider than your main agent model.** If both the main model
and the guardian use the same provider, a single provider outage takes down both
the agent and its safety layer. Using a different provider ensures the guardian
remains available even when the main model's provider has issues. For example,
if your main model is `anthropic/claude-sonnet-4-20250514`, use
`openai/gpt-4o-mini` for the guardian.
**To proceed with the blocked action**, simply confirm it in the conversation:
| Model | Notes |
| --------------------- | ------------------------------------------- |
| `openai/gpt-4o-mini` | Fast (~200ms), cheap, good accuracy |
| `kimi/moonshot-v1-8k` | Good for Chinese-language conversations |
| `ollama/llama3.1:8b` | Free, runs locally, slightly lower accuracy |
> "yes, go ahead and delete /tmp/old"
Avoid using the same large model as your main agent — it wastes cost and adds
latency to every watched tool call.
The guardian re-evaluates every tool call independently. On the next attempt it
will see your explicit confirmation in the recent conversation and ALLOW the
call.
If a tool is producing too many false positives, you can also:
- Remove it from `watched_tools`
- Switch to `"mode": "audit"` (log-only, no blocking)
- Disable the plugin entirely (`"enabled": false`)
## Context awareness
The guardian uses a **rolling summary + recent turns** strategy to provide
long-term context without wasting tokens:
The guardian builds rich context for each tool call review:
- **Agent context** — the main agent's full system prompt, cached on the
first `llm_input` call. Contains AGENTS.md rules, MEMORY.md content,
tool definitions, available skills, and user-configured instructions.
Passed as-is (no extraction or summarization) since guardian models have
128K+ context windows. Treated as background DATA — user messages remain
the ultimate authority.
- **Session summary** — a 2-4 sentence summary of the entire conversation
history, covering tasks requested, files/systems being worked on, standing
instructions, and confirmations. Updated asynchronously after each user
message (non-blocking). Roughly ~150 tokens.
history, covering tasks requested, files/systems being worked on, and
confirmations. Updated asynchronously after each user message
(non-blocking). Roughly ~150 tokens.
- **Recent conversation turns** — the last `max_recent_turns` (default 3)
raw turns with user messages, assistant replies, and tool results. Roughly
~600 tokens.
@ -186,9 +226,6 @@ long-term context without wasting tokens:
new user input, trailing assistant messages and tool results are attached
to the last conversation turn.
This approach keeps the guardian prompt at ~750 tokens (vs ~2000 for 10 raw
turns), while preserving full conversation context through the summary.
The context is extracted **lazily** at `before_tool_call` time from the live
session message array, so it always reflects the latest state — including tool
results that arrived after the initial `llm_input` hook fired.
@ -206,6 +243,8 @@ parent agent's).
- Assistant replies are treated as **context only** — they may be poisoned
- Only user messages are considered authoritative intent signals
- Tool results (shown as `[tool: ...]`) are treated as DATA
- Memory results are recognized as the user's own saved preferences
- Agent context (system prompt) is treated as background DATA — it may be
indirectly poisoned (e.g. malicious rules written to memory or a trojan
skill in a cloned repo); user messages remain the ultimate authority
- Forward scanning of guardian response prevents attacker-injected ALLOW in
tool arguments from overriding the model's verdict