10 KiB

Raw Blame History

title

summary

read_when

Prompt Caching

Prompt caching knobs, merge order, provider behavior, and tuning patterns

You want to reduce prompt token costs with cache retention

You need per-agent cache behavior in multi-agent setups

You are tuning heartbeat and cache-ttl pruning together

Prompt caching

Prompt caching means the model provider can reuse unchanged prompt prefixes (usually system/developer instructions and other stable context) across turns instead of re-processing them every time. OpenClaw normalizes provider usage into cacheRead and cacheWrite where the upstream API exposes those counters directly.

Why this matters: lower token cost, faster responses, and more predictable performance for long-running sessions. Without caching, repeated prompts pay the full prompt cost on every turn even when most input did not change.

This page covers all cache-related knobs that affect prompt reuse and token cost.

Provider references:

Anthropic prompt caching: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
OpenAI prompt caching: https://developers.openai.com/api/docs/guides/prompt-caching
OpenAI API headers and request IDs: https://developers.openai.com/api/reference/overview
Anthropic request IDs and errors: https://platform.claude.com/docs/en/api/errors

Primary knobs

`cacheRetention` (global default, model, and per-agent)

Set cache retention as a global default for all models:

agents:
  defaults:
    params:
      cacheRetention: "long" # none | short | long

Override per-model:

agents:
  defaults:
    models:
      "anthropic/claude-opus-4-6":
        params:
          cacheRetention: "short" # none | short | long

Per-agent override:

agents:
  list:
    - id: "alerts"
      params:
        cacheRetention: "none"

Config merge order:

agents.defaults.params (global default — applies to all models)
agents.defaults.models["provider/model"].params (per-model override)
agents.list[].params (matching agent id; overrides by key)

`contextPruning.mode: "cache-ttl"`

Prunes old tool-result context after cache TTL windows so post-idle requests do not re-cache oversized history.

agents:
  defaults:
    contextPruning:
      mode: "cache-ttl"
      ttl: "1h"

See Session Pruning for full behavior.

Heartbeat keep-warm

Heartbeat can keep cache windows warm and reduce repeated cache writes after idle gaps.

agents:
  defaults:
    heartbeat:
      every: "55m"

Per-agent heartbeat is supported at agents.list[].heartbeat.

Provider behavior

Anthropic (direct API)

cacheRetention is supported.
With Anthropic API-key auth profiles, OpenClaw seeds cacheRetention: "short" for Anthropic model refs when unset.
Anthropic native Messages responses expose both cache_read_input_tokens and cache_creation_input_tokens, so OpenClaw can show both cacheRead and cacheWrite.
For native Anthropic requests, cacheRetention: "short" maps to the default 5-minute ephemeral cache, and cacheRetention: "long" upgrades to the 1-hour TTL only on direct api.anthropic.com hosts.

OpenAI (direct API)

Prompt caching is automatic on supported recent models. OpenClaw does not need to inject block-level cache markers.
OpenClaw uses prompt_cache_key to keep cache routing stable across turns and uses prompt_cache_retention: "24h" only when cacheRetention: "long" is selected on direct OpenAI hosts.
OpenAI responses expose cached prompt tokens via usage.prompt_tokens_details.cached_tokens (or input_tokens_details.cached_tokens on Responses API events). OpenClaw maps that to cacheRead.
OpenAI does not expose a separate cache-write token counter, so cacheWrite stays 0 on OpenAI paths even when the provider is warming a cache.
OpenAI returns useful tracing and rate-limit headers such as x-request-id, openai-processing-ms, and x-ratelimit-*, but cache-hit accounting should come from the usage payload, not from headers.
In practice, OpenAI often behaves like an initial-prefix cache rather than Anthropic-style moving full-history reuse. Stable long-prefix text turns can land near a 4864 cached-token plateau in current live probes, while tool-heavy or MCP-style transcripts often plateau near 4608 cached tokens even on exact repeats.

Amazon Bedrock

Anthropic Claude model refs (amazon-bedrock/*anthropic.claude*) support explicit cacheRetention pass-through.
Non-Anthropic Bedrock models are forced to cacheRetention: "none" at runtime.

OpenRouter Anthropic models

For openrouter/anthropic/* model refs, OpenClaw injects Anthropic cache_control on system/developer prompt blocks to improve prompt-cache reuse.

Other providers

If the provider does not support this cache mode, cacheRetention has no effect.

Tuning patterns

Mixed traffic (recommended default)

Keep a long-lived baseline on your main agent, disable caching on bursty notifier agents:

agents:
  defaults:
    model:
      primary: "anthropic/claude-opus-4-6"
    models:
      "anthropic/claude-opus-4-6":
        params:
          cacheRetention: "long"
  list:
    - id: "research"
      default: true
      heartbeat:
        every: "55m"
    - id: "alerts"
      params:
        cacheRetention: "none"

Cost-first baseline

Set baseline cacheRetention: "short".
Enable contextPruning.mode: "cache-ttl".
Keep heartbeat below your TTL only for agents that benefit from warm caches.

Cache diagnostics

OpenClaw exposes dedicated cache-trace diagnostics for embedded agent runs.

Live regression tests

OpenClaw keeps one combined live cache regression gate for repeated prefixes, tool turns, image turns, MCP-style tool transcripts, and an Anthropic no-cache control.

src/agents/live-cache-regression.live.test.ts
src/agents/live-cache-regression-baseline.ts

Run the narrow live gate with:

OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_CACHE_TEST=1 pnpm test:live:cache

The baseline file stores the most recent observed live numbers plus the provider-specific regression floors used by the test. The runner also uses fresh per-run session IDs and prompt namespaces so previous cache state does not pollute the current regression sample.

These tests intentionally do not use identical success criteria across providers.

Anthropic live expectations

Expect explicit warmup writes via cacheWrite.
Expect near-full history reuse on repeated turns because Anthropic cache control advances the cache breakpoint through the conversation.
Current live assertions still use high hit-rate thresholds for stable, tool, and image paths.

OpenAI live expectations

Expect cacheRead only. cacheWrite remains 0.
Treat repeated-turn cache reuse as a provider-specific plateau, not as Anthropic-style moving full-history reuse.
Current live assertions use conservative floor checks derived from observed live behavior on gpt-5.4-mini:
- stable prefix: cacheRead >= 4608, hit rate >= 0.90
- tool transcript: cacheRead >= 4096, hit rate >= 0.85
- image transcript: cacheRead >= 3840, hit rate >= 0.82
- MCP-style transcript: cacheRead >= 4096, hit rate >= 0.85

Fresh combined live verification on 2026-04-04 landed at:

stable prefix: cacheRead=4864, hit rate 0.966
tool transcript: cacheRead=4608, hit rate 0.896
image transcript: cacheRead=4864, hit rate 0.954
MCP-style transcript: cacheRead=4608, hit rate 0.891

Recent local wall-clock time for the combined gate was about 88s.

Why the assertions differ:

Anthropic exposes explicit cache breakpoints and moving conversation-history reuse.
OpenAI prompt caching is still exact-prefix sensitive, but the effective reusable prefix in live Responses traffic can plateau earlier than the full prompt.
Because of that, comparing Anthropic and OpenAI by a single cross-provider percentage threshold creates false regressions.

`diagnostics.cacheTrace` config

diagnostics:
  cacheTrace:
    enabled: true
    filePath: "~/.openclaw/logs/cache-trace.jsonl" # optional
    includeMessages: false # default true
    includePrompt: false # default true
    includeSystem: false # default true

Defaults:

filePath: $OPENCLAW_STATE_DIR/logs/cache-trace.jsonl
includeMessages: true
includePrompt: true
includeSystem: true

Env toggles (one-off debugging)

OPENCLAW_CACHE_TRACE=1 enables cache tracing.
OPENCLAW_CACHE_TRACE_FILE=/path/to/cache-trace.jsonl overrides output path.
OPENCLAW_CACHE_TRACE_MESSAGES=0|1 toggles full message payload capture.
OPENCLAW_CACHE_TRACE_PROMPT=0|1 toggles prompt text capture.
OPENCLAW_CACHE_TRACE_SYSTEM=0|1 toggles system prompt capture.

What to inspect

Cache trace events are JSONL and include staged snapshots like session:loaded, prompt:before, stream:context, and session:after.
Per-turn cache token impact is visible in normal usage surfaces via cacheRead and cacheWrite (for example /usage full and session usage summaries).
For Anthropic, expect both cacheRead and cacheWrite when caching is active.
For OpenAI, expect cacheRead on cache hits and cacheWrite to remain 0; OpenAI does not publish a separate cache-write token field.
If you need request tracing, log request IDs and rate-limit headers separately from cache metrics. OpenClaw's current cache-trace output is focused on prompt/session shape and normalized token usage rather than raw provider response headers.

Quick troubleshooting

High cacheWrite on most turns: check for volatile system-prompt inputs and verify model/provider supports your cache settings.
High cacheWrite on Anthropic: often means the cache breakpoint is landing on content that changes every request.
Low OpenAI cacheRead: verify the stable prefix is at the front, the repeated prefix is at least 1024 tokens, and the same prompt_cache_key is reused for turns that should share a cache.
No effect from cacheRetention: confirm model key matches agents.defaults.models["provider/model"].
Bedrock Nova/Mistral requests with cache settings: expected runtime force to none.

Related docs:

10 KiB Raw Blame History