From 79be1e126aba89f66d1101df3eff9d53f7ee9665 Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Sat, 4 Apr 2026 14:18:07 +0100 Subject: [PATCH] fix: harden parallels smoke harness --- .../skills/openclaw-parallels-smoke/SKILL.md | 4 ++ scripts/e2e/parallels-linux-smoke.sh | 14 +++++ scripts/e2e/parallels-macos-smoke.sh | 58 ++++++++++++++++++- scripts/e2e/parallels-npm-update-smoke.sh | 9 +++ scripts/e2e/parallels-windows-smoke.sh | 24 +++++++- 5 files changed, 107 insertions(+), 2 deletions(-) diff --git a/.agents/skills/openclaw-parallels-smoke/SKILL.md b/.agents/skills/openclaw-parallels-smoke/SKILL.md index 9d504796257..0e0a33ba374 100644 --- a/.agents/skills/openclaw-parallels-smoke/SKILL.md +++ b/.agents/skills/openclaw-parallels-smoke/SKILL.md @@ -30,6 +30,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo - Flow: fresh snapshot -> install npm package baseline -> smoke -> install current main tgz on the same guest -> smoke again. - Same-guest update verification should set the default model explicitly to `openai/gpt-5.4` before the agent turn and use a fresh explicit `--session-id` so old session model state does not leak into the check. - The aggregate npm-update wrapper must resolve the Linux VM with the same Ubuntu fallback policy as `parallels-linux-smoke.sh` before both fresh and update lanes. Treat any Ubuntu guest with major version `>= 24` as acceptable when the exact default VM is missing, preferring the closest version match. On Peter's current host today, missing `Ubuntu 24.04.3 ARM64` should fall back to `Ubuntu 25.10`. +- On macOS same-guest update checks, restart the gateway after the npm upgrade before `gateway status` / `agent`; launchd can otherwise report a loaded service while the old process has exited and the fresh process is not RPC-ready yet. - On Windows same-guest update checks, restart the gateway after the npm upgrade before `gateway status` / `agent`; in-place global npm updates can otherwise leave stale hashed `dist/*` module imports alive in the running service. - For Windows same-guest update checks, prefer the done-file/log-drain PowerShell runner pattern over one long-lived `prlctl exec ... powershell -EncodedCommand ...` transport. The guest can finish successfully while the outer `prlctl exec` still hangs. - The Windows same-guest update helper should write stage markers to its log before long steps like tgz download and `npm install -g` so the outer progress monitor does not sit on `waiting for first log line` during healthy but quiet installs. @@ -46,6 +47,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo - Preferred entrypoint: `pnpm test:parallels:macos` - Default to the snapshot closest to `macOS 26.3.1 latest`. - On Peter's Tahoe VM, `fresh-latest-march-2026` can hang in `prlctl snapshot-switch`; if restore times out there, rerun with `--snapshot-hint 'macOS 26.3.1 latest'` before blaming auth or the harness. +- `parallels-macos-smoke.sh` now retries `snapshot-switch` once after force-stopping a stuck running/suspended guest. If Tahoe still times out after that recovery path, then treat it as a real Parallels/host issue and rerun manually. - The macOS smoke should include a dashboard load phase after gateway health: resolve the tokenized URL with `openclaw dashboard --no-open`, verify the served HTML contains the Control UI title/root shell, then open Safari and require an established localhost TCP connection from Safari to the gateway port. - If a packaged install regresses with `500` on `/`, `/healthz`, or `__openclaw/control-ui-config.json` after `fresh.install-main` or `upgrade.install-main`, suspect bundled plugin runtime deps resolving from the package root `node_modules` rather than `dist/extensions/*/node_modules`. Repro quickly with a real `npm pack`/global install lane before blaming dashboard auth or Safari. - `prlctl exec` is fine for deterministic repo commands, but use the guest Terminal or `prlctl enter` when installer parity or shell-sensitive behavior matters. @@ -64,6 +66,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo - Use PowerShell only as the transport with `-ExecutionPolicy Bypass`, then call the `.cmd` shims from inside it. - Multi-word `openclaw agent --message ...` checks should call `& $openclaw ...` inside PowerShell, not `Start-Process ... -ArgumentList` against `openclaw.cmd`, or Commander can see split argv and throw `too many arguments for 'agent'`. - Windows installer/tgz phases now retry once after guest-ready recheck; keep new Windows smoke steps idempotent so a transport-flake retry is safe. +- If a Windows retry sees the VM become `suspended` or `stopped`, resume/start it before the next `prlctl exec`; otherwise the second attempt just repeats the same `rc=255`. - Windows global `npm install -g` phases can stay quiet for a minute or more even when healthy; inspect the phase log before calling it hung, and only treat it as a regression once the retry wrapper or timeout trips. - Fresh Windows ref-mode onboard should use the same background PowerShell runner plus done-file/log-drain pattern as the npm-update helper, including startup materialization checks, host-side timeouts on short poll `prlctl exec` calls, and retry-on-poll-failure behavior for transient transport flakes. - Fresh Windows ref-mode agent verification should set `OPENAI_API_KEY` in the PowerShell environment before invoking `openclaw.cmd agent`, for the same pairing-required fallback reason as macOS. @@ -82,6 +85,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo - Fresh `main` tgz smoke still needs the latest-release installer first because the snapshot has no Node or npm before bootstrap. - This snapshot does not have a usable `systemd --user` session; managed daemon install is unsupported. - The Linux smoke now falls back to a manual `setsid openclaw gateway run --bind loopback --port 18789 --force` launch with `HOME=/root` and the provider secret exported, then verifies `gateway status --deep --require-rpc` when available. +- The Linux manual gateway launch should wait for `gateway status --deep --require-rpc` inside the `gateway-start` phase; otherwise the first status probe can race the background bind and fail a healthy lane. - If Linux gateway bring-up fails, inspect `/tmp/openclaw-parallels-linux-gateway.log` in the guest phase logs first; the common failure mode is a missing provider secret in the launched gateway environment. ## Discord roundtrip diff --git a/scripts/e2e/parallels-linux-smoke.sh b/scripts/e2e/parallels-linux-smoke.sh index 0f267144f67..86535b3fbf2 100644 --- a/scripts/e2e/parallels-linux-smoke.sh +++ b/scripts/e2e/parallels-linux-smoke.sh @@ -634,6 +634,20 @@ setsid sh -lc 'exec env OPENCLAW_HOME=/root OPENCLAW_STATE_DIR=/root/.openclaw O EOF )" guest_exec bash -lc "$cmd" + + # On the Ubuntu guest the backgrounded process can bind a few seconds after + # the launch command returns. Keep the race inside gateway-start instead of + # failing the next phase with a false-negative RPC probe. + local deadline + deadline=$((SECONDS + TIMEOUT_GATEWAY_S)) + while (( SECONDS < deadline )); do + if show_gateway_status_compat >/dev/null 2>&1; then + return 0 + fi + sleep 2 + done + + return 1 } show_gateway_status_compat() { diff --git a/scripts/e2e/parallels-macos-smoke.sh b/scripts/e2e/parallels-macos-smoke.sh index 7895c4c15e7..f29a0300d81 100644 --- a/scripts/e2e/parallels-macos-smoke.sh +++ b/scripts/e2e/parallels-macos-smoke.sh @@ -474,6 +474,62 @@ wait_for_current_user() { return 1 } +host_timeout_exec() { + local timeout_s="$1" + shift + HOST_TIMEOUT_S="$timeout_s" python3 - "$@" <<'PY' +import os +import subprocess +import sys + +timeout = int(os.environ["HOST_TIMEOUT_S"]) +args = sys.argv[1:] + +try: + completed = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout) +except subprocess.TimeoutExpired as exc: + if exc.stdout: + sys.stdout.buffer.write(exc.stdout) + if exc.stderr: + sys.stderr.buffer.write(exc.stderr) + sys.stderr.write(f"host timeout after {timeout}s\n") + raise SystemExit(124) + +if completed.stdout: + sys.stdout.buffer.write(completed.stdout) +if completed.stderr: + sys.stderr.buffer.write(completed.stderr) +raise SystemExit(completed.returncode) +PY +} + +snapshot_switch_with_retry() { + local snapshot_id="$1" + local attempt rc status + rc=0 + for attempt in 1 2; do + set +e + host_timeout_exec "$TIMEOUT_SNAPSHOT_S" prlctl snapshot-switch "$VM_NAME" --id "$snapshot_id" >/dev/null + rc=$? + set -e + if [[ $rc -eq 0 ]]; then + return 0 + fi + # Tahoe occasionally gets stuck mid snapshot-switch and leaves the guest + # running or suspended. Reset that state and try once more before failing + # the whole lane. + warn "snapshot-switch attempt $attempt failed (rc=$rc)" + status="$(prlctl status "$VM_NAME" 2>/dev/null || true)" + [[ -n "$status" ]] && warn "vm status after snapshot-switch failure: $status" + if [[ "$status" == *" running" || "$status" == *" suspended" ]]; then + prlctl stop "$VM_NAME" --kill >/dev/null 2>&1 || true + wait_for_vm_status "stopped" || true + fi + sleep 3 + done + return "$rc" +} + guest_current_user_exec() { prlctl exec "$VM_NAME" --current-user /usr/bin/env \ PATH=/opt/homebrew/bin:/opt/homebrew/opt/node/bin:/opt/homebrew/sbin:/usr/bin:/bin:/usr/sbin:/sbin \ @@ -551,7 +607,7 @@ guest_current_user_sh() { restore_snapshot() { local snapshot_id="$1" say "Restore snapshot $SNAPSHOT_HINT ($snapshot_id)" - prlctl snapshot-switch "$VM_NAME" --id "$snapshot_id" >/dev/null + snapshot_switch_with_retry "$snapshot_id" || die "snapshot switch failed for $VM_NAME" if [[ "$SNAPSHOT_STATE" == "poweroff" ]]; then wait_for_vm_status "stopped" || die "restored poweroff snapshot did not reach stopped state in $VM_NAME" say "Start restored poweroff snapshot $SNAPSHOT_NAME" diff --git a/scripts/e2e/parallels-npm-update-smoke.sh b/scripts/e2e/parallels-npm-update-smoke.sh index 4c30da197b1..5145f53f175 100755 --- a/scripts/e2e/parallels-npm-update-smoke.sh +++ b/scripts/e2e/parallels-npm-update-smoke.sh @@ -700,6 +700,15 @@ case "\$version" in ;; esac /opt/homebrew/bin/openclaw models set "$MODEL_ID" +# Same-guest npm upgrades can leave launchd holding the old gateway process or +# module graph briefly; wait for a fresh RPC-ready restart before the agent turn. +/opt/homebrew/bin/openclaw gateway restart +for _ in 1 2 3 4 5 6 7 8; do + if /opt/homebrew/bin/openclaw gateway status --deep --require-rpc >/dev/null 2>&1; then + break + fi + sleep 2 +done /opt/homebrew/bin/openclaw gateway status --deep --require-rpc /usr/bin/env "$API_KEY_ENV=$API_KEY_VALUE" /opt/homebrew/bin/openclaw agent --agent main --session-id parallels-npm-update-macos-$head_short --message "Reply with exact ASCII text OK only." --json EOF diff --git a/scripts/e2e/parallels-windows-smoke.sh b/scripts/e2e/parallels-windows-smoke.sh index 624d7b2781f..3488ba8392a 100644 --- a/scripts/e2e/parallels-windows-smoke.sh +++ b/scripts/e2e/parallels-windows-smoke.sh @@ -445,6 +445,23 @@ EOF )" } +ensure_vm_running_for_retry() { + local status + status="$(prlctl status "$VM_NAME" 2>/dev/null || true)" + case "$status" in + *" suspended") + # Some Windows guest transport drops leave the VM suspended between retry + # attempts; wake it before the next prlctl exec. + warn "VM suspended during retry path; resuming $VM_NAME" + prlctl resume "$VM_NAME" >/dev/null + ;; + *" stopped") + warn "VM stopped during retry path; starting $VM_NAME" + prlctl start "$VM_NAME" >/dev/null + ;; + esac +} + run_windows_retry() { local label="$1" local max_attempts="$2" @@ -463,7 +480,12 @@ run_windows_retry() { fi warn "$label attempt $attempt failed (rc=$rc)" if (( attempt < max_attempts )); then - wait_for_guest_ready >/dev/null 2>&1 || true + if ! ensure_vm_running_for_retry >/dev/null 2>&1; then + : + fi + if ! wait_for_guest_ready >/dev/null 2>&1; then + : + fi sleep 5 fi done