The heartbeat scheduler silently dies after the first batch because
run() has multiple exit paths that skip scheduleNext(). Once any path
returns without re-arming the timer, heartbeats stop permanently.
The most common trigger is the requests-in-flight early return, but
any unhandled rejection in the async chain between runOnce() and the
manual scheduleNext() calls can also kill the timer.
Fix: wrap the run() body in try/finally so scheduleNext() is called
on every exit path. All manual scheduleNext() calls inside are removed
— the finally block is the single re-arm point.
The requests-in-flight case is excluded via a flag because the wake
layer (heartbeat-wake.ts) already handles retry scheduling for this
case with DEFAULT_RETRY_MS. Calling scheduleNext() here would register
a 0ms timer that races with the wake layer's 1s retry.
Fixes#31139Fixes#45772
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTTP 500 (Internal Server Error) was not triggering model fallback,
causing agents to fail outright instead of trying the next candidate.
This is inconsistent with TRANSIENT_HTTP_ERROR_CODES which already
includes 500. Aligns the direct status check with that constant.
Co-authored-by: Craig McWilliams <craigamcw@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Claude CLI backend uses `--output-format json`, which produces no
stdout until the entire request completes. When session context is large
(100K+ tokens) or API response is slow, the no-output watchdog timer
(max 180s for resume sessions) kills the process before it finishes,
resulting in "CLI produced no output for 180s and was terminated" errors.
Switch to `--output-format stream-json --verbose` so Claude CLI emits
NDJSON events throughout processing (init, assistant, rate_limit, result).
Each event resets the watchdog timer, which is the intended behavior —
the watchdog detects truly stuck processes, not slow-but-progressing ones.
Changes:
- cli-backends.ts: `json` → `stream-json --verbose`, `output: "jsonl"`
- helpers.ts: teach parseCliJsonl to extract text from Claude's
`{"type":"result","result":"..."}` NDJSON line
Note: `--verbose` is required for stream-json in `-p` (print) mode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the CLI backend output mode is "text", sessionId was hardcoded to
undefined. This caused the fallback chain to store the OpenClaw internal
UUID as the CLI session ID. On resume, --resume was called with the
wrong UUID, resulting in "No conversation found with session ID".
Return resolvedSessionId instead of undefined so the correct CLI session
ID is persisted and resume works correctly.
codex-cli lacks systemPromptArg, so the system prompt is never
serialized into argv — making the not-toContain assertion pass
vacuously even on pre-fix code. Switch to claude-cli which defines
systemPromptArg ("--append-system-prompt") and add a positive
assertion that the user-supplied prompt IS present in argv.
Co-Authored-By: Dhiman's Agentic Suite <dhiman.seal@hotmail.com>
`runCliAgent()` unconditionally appended "Tools are disabled in this
session. Do not call tools." to `extraSystemPrompt` for every CLI
backend session. The intent was to prevent the LLM from calling
OpenClaw's embedded API tools (since CLI backends manage their own
tools natively). However, CLI agents like Claude Code interpret this
text as a blanket prohibition on ALL tools, including their own native
Bash, Read, and Write tools.
This caused silent failures across cron jobs, group chats, and DM
sessions when using any CLI backend: the agent would see the injected
text in the system prompt and refuse to execute tools, returning text
responses instead. Cron jobs reported `lastStatus: "ok"` despite the
agent failing to run scripts.
The fix removes the hardcoded string entirely. CLI backends already
receive `tools: []` (no OpenClaw embedded tools in the API call), so
the text was redundant at best.
Closes#44135
Co-Authored-By: Dhiman's Agentic Suite <dhiman.seal@hotmail.com>