mirror of https://github.com/openclaw/openclaw.git
fix: harden parallels smoke harness
This commit is contained in:
parent
99e45eb3ba
commit
79be1e126a
|
|
@ -30,6 +30,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
|
|||
- Flow: fresh snapshot -> install npm package baseline -> smoke -> install current main tgz on the same guest -> smoke again.
|
||||
- Same-guest update verification should set the default model explicitly to `openai/gpt-5.4` before the agent turn and use a fresh explicit `--session-id` so old session model state does not leak into the check.
|
||||
- The aggregate npm-update wrapper must resolve the Linux VM with the same Ubuntu fallback policy as `parallels-linux-smoke.sh` before both fresh and update lanes. Treat any Ubuntu guest with major version `>= 24` as acceptable when the exact default VM is missing, preferring the closest version match. On Peter's current host today, missing `Ubuntu 24.04.3 ARM64` should fall back to `Ubuntu 25.10`.
|
||||
- On macOS same-guest update checks, restart the gateway after the npm upgrade before `gateway status` / `agent`; launchd can otherwise report a loaded service while the old process has exited and the fresh process is not RPC-ready yet.
|
||||
- On Windows same-guest update checks, restart the gateway after the npm upgrade before `gateway status` / `agent`; in-place global npm updates can otherwise leave stale hashed `dist/*` module imports alive in the running service.
|
||||
- For Windows same-guest update checks, prefer the done-file/log-drain PowerShell runner pattern over one long-lived `prlctl exec ... powershell -EncodedCommand ...` transport. The guest can finish successfully while the outer `prlctl exec` still hangs.
|
||||
- The Windows same-guest update helper should write stage markers to its log before long steps like tgz download and `npm install -g` so the outer progress monitor does not sit on `waiting for first log line` during healthy but quiet installs.
|
||||
|
|
@ -46,6 +47,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
|
|||
- Preferred entrypoint: `pnpm test:parallels:macos`
|
||||
- Default to the snapshot closest to `macOS 26.3.1 latest`.
|
||||
- On Peter's Tahoe VM, `fresh-latest-march-2026` can hang in `prlctl snapshot-switch`; if restore times out there, rerun with `--snapshot-hint 'macOS 26.3.1 latest'` before blaming auth or the harness.
|
||||
- `parallels-macos-smoke.sh` now retries `snapshot-switch` once after force-stopping a stuck running/suspended guest. If Tahoe still times out after that recovery path, then treat it as a real Parallels/host issue and rerun manually.
|
||||
- The macOS smoke should include a dashboard load phase after gateway health: resolve the tokenized URL with `openclaw dashboard --no-open`, verify the served HTML contains the Control UI title/root shell, then open Safari and require an established localhost TCP connection from Safari to the gateway port.
|
||||
- If a packaged install regresses with `500` on `/`, `/healthz`, or `__openclaw/control-ui-config.json` after `fresh.install-main` or `upgrade.install-main`, suspect bundled plugin runtime deps resolving from the package root `node_modules` rather than `dist/extensions/*/node_modules`. Repro quickly with a real `npm pack`/global install lane before blaming dashboard auth or Safari.
|
||||
- `prlctl exec` is fine for deterministic repo commands, but use the guest Terminal or `prlctl enter` when installer parity or shell-sensitive behavior matters.
|
||||
|
|
@ -64,6 +66,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
|
|||
- Use PowerShell only as the transport with `-ExecutionPolicy Bypass`, then call the `.cmd` shims from inside it.
|
||||
- Multi-word `openclaw agent --message ...` checks should call `& $openclaw ...` inside PowerShell, not `Start-Process ... -ArgumentList` against `openclaw.cmd`, or Commander can see split argv and throw `too many arguments for 'agent'`.
|
||||
- Windows installer/tgz phases now retry once after guest-ready recheck; keep new Windows smoke steps idempotent so a transport-flake retry is safe.
|
||||
- If a Windows retry sees the VM become `suspended` or `stopped`, resume/start it before the next `prlctl exec`; otherwise the second attempt just repeats the same `rc=255`.
|
||||
- Windows global `npm install -g` phases can stay quiet for a minute or more even when healthy; inspect the phase log before calling it hung, and only treat it as a regression once the retry wrapper or timeout trips.
|
||||
- Fresh Windows ref-mode onboard should use the same background PowerShell runner plus done-file/log-drain pattern as the npm-update helper, including startup materialization checks, host-side timeouts on short poll `prlctl exec` calls, and retry-on-poll-failure behavior for transient transport flakes.
|
||||
- Fresh Windows ref-mode agent verification should set `OPENAI_API_KEY` in the PowerShell environment before invoking `openclaw.cmd agent`, for the same pairing-required fallback reason as macOS.
|
||||
|
|
@ -82,6 +85,7 @@ Use this skill for Parallels guest workflows and smoke interpretation. Do not lo
|
|||
- Fresh `main` tgz smoke still needs the latest-release installer first because the snapshot has no Node or npm before bootstrap.
|
||||
- This snapshot does not have a usable `systemd --user` session; managed daemon install is unsupported.
|
||||
- The Linux smoke now falls back to a manual `setsid openclaw gateway run --bind loopback --port 18789 --force` launch with `HOME=/root` and the provider secret exported, then verifies `gateway status --deep --require-rpc` when available.
|
||||
- The Linux manual gateway launch should wait for `gateway status --deep --require-rpc` inside the `gateway-start` phase; otherwise the first status probe can race the background bind and fail a healthy lane.
|
||||
- If Linux gateway bring-up fails, inspect `/tmp/openclaw-parallels-linux-gateway.log` in the guest phase logs first; the common failure mode is a missing provider secret in the launched gateway environment.
|
||||
|
||||
## Discord roundtrip
|
||||
|
|
|
|||
|
|
@ -634,6 +634,20 @@ setsid sh -lc 'exec env OPENCLAW_HOME=/root OPENCLAW_STATE_DIR=/root/.openclaw O
|
|||
EOF
|
||||
)"
|
||||
guest_exec bash -lc "$cmd"
|
||||
|
||||
# On the Ubuntu guest the backgrounded process can bind a few seconds after
|
||||
# the launch command returns. Keep the race inside gateway-start instead of
|
||||
# failing the next phase with a false-negative RPC probe.
|
||||
local deadline
|
||||
deadline=$((SECONDS + TIMEOUT_GATEWAY_S))
|
||||
while (( SECONDS < deadline )); do
|
||||
if show_gateway_status_compat >/dev/null 2>&1; then
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
return 1
|
||||
}
|
||||
|
||||
show_gateway_status_compat() {
|
||||
|
|
|
|||
|
|
@ -474,6 +474,62 @@ wait_for_current_user() {
|
|||
return 1
|
||||
}
|
||||
|
||||
host_timeout_exec() {
|
||||
local timeout_s="$1"
|
||||
shift
|
||||
HOST_TIMEOUT_S="$timeout_s" python3 - "$@" <<'PY'
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
timeout = int(os.environ["HOST_TIMEOUT_S"])
|
||||
args = sys.argv[1:]
|
||||
|
||||
try:
|
||||
completed = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
|
||||
except subprocess.TimeoutExpired as exc:
|
||||
if exc.stdout:
|
||||
sys.stdout.buffer.write(exc.stdout)
|
||||
if exc.stderr:
|
||||
sys.stderr.buffer.write(exc.stderr)
|
||||
sys.stderr.write(f"host timeout after {timeout}s\n")
|
||||
raise SystemExit(124)
|
||||
|
||||
if completed.stdout:
|
||||
sys.stdout.buffer.write(completed.stdout)
|
||||
if completed.stderr:
|
||||
sys.stderr.buffer.write(completed.stderr)
|
||||
raise SystemExit(completed.returncode)
|
||||
PY
|
||||
}
|
||||
|
||||
snapshot_switch_with_retry() {
|
||||
local snapshot_id="$1"
|
||||
local attempt rc status
|
||||
rc=0
|
||||
for attempt in 1 2; do
|
||||
set +e
|
||||
host_timeout_exec "$TIMEOUT_SNAPSHOT_S" prlctl snapshot-switch "$VM_NAME" --id "$snapshot_id" >/dev/null
|
||||
rc=$?
|
||||
set -e
|
||||
if [[ $rc -eq 0 ]]; then
|
||||
return 0
|
||||
fi
|
||||
# Tahoe occasionally gets stuck mid snapshot-switch and leaves the guest
|
||||
# running or suspended. Reset that state and try once more before failing
|
||||
# the whole lane.
|
||||
warn "snapshot-switch attempt $attempt failed (rc=$rc)"
|
||||
status="$(prlctl status "$VM_NAME" 2>/dev/null || true)"
|
||||
[[ -n "$status" ]] && warn "vm status after snapshot-switch failure: $status"
|
||||
if [[ "$status" == *" running" || "$status" == *" suspended" ]]; then
|
||||
prlctl stop "$VM_NAME" --kill >/dev/null 2>&1 || true
|
||||
wait_for_vm_status "stopped" || true
|
||||
fi
|
||||
sleep 3
|
||||
done
|
||||
return "$rc"
|
||||
}
|
||||
|
||||
guest_current_user_exec() {
|
||||
prlctl exec "$VM_NAME" --current-user /usr/bin/env \
|
||||
PATH=/opt/homebrew/bin:/opt/homebrew/opt/node/bin:/opt/homebrew/sbin:/usr/bin:/bin:/usr/sbin:/sbin \
|
||||
|
|
@ -551,7 +607,7 @@ guest_current_user_sh() {
|
|||
restore_snapshot() {
|
||||
local snapshot_id="$1"
|
||||
say "Restore snapshot $SNAPSHOT_HINT ($snapshot_id)"
|
||||
prlctl snapshot-switch "$VM_NAME" --id "$snapshot_id" >/dev/null
|
||||
snapshot_switch_with_retry "$snapshot_id" || die "snapshot switch failed for $VM_NAME"
|
||||
if [[ "$SNAPSHOT_STATE" == "poweroff" ]]; then
|
||||
wait_for_vm_status "stopped" || die "restored poweroff snapshot did not reach stopped state in $VM_NAME"
|
||||
say "Start restored poweroff snapshot $SNAPSHOT_NAME"
|
||||
|
|
|
|||
|
|
@ -700,6 +700,15 @@ case "\$version" in
|
|||
;;
|
||||
esac
|
||||
/opt/homebrew/bin/openclaw models set "$MODEL_ID"
|
||||
# Same-guest npm upgrades can leave launchd holding the old gateway process or
|
||||
# module graph briefly; wait for a fresh RPC-ready restart before the agent turn.
|
||||
/opt/homebrew/bin/openclaw gateway restart
|
||||
for _ in 1 2 3 4 5 6 7 8; do
|
||||
if /opt/homebrew/bin/openclaw gateway status --deep --require-rpc >/dev/null 2>&1; then
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
/opt/homebrew/bin/openclaw gateway status --deep --require-rpc
|
||||
/usr/bin/env "$API_KEY_ENV=$API_KEY_VALUE" /opt/homebrew/bin/openclaw agent --agent main --session-id parallels-npm-update-macos-$head_short --message "Reply with exact ASCII text OK only." --json
|
||||
EOF
|
||||
|
|
|
|||
|
|
@ -445,6 +445,23 @@ EOF
|
|||
)"
|
||||
}
|
||||
|
||||
ensure_vm_running_for_retry() {
|
||||
local status
|
||||
status="$(prlctl status "$VM_NAME" 2>/dev/null || true)"
|
||||
case "$status" in
|
||||
*" suspended")
|
||||
# Some Windows guest transport drops leave the VM suspended between retry
|
||||
# attempts; wake it before the next prlctl exec.
|
||||
warn "VM suspended during retry path; resuming $VM_NAME"
|
||||
prlctl resume "$VM_NAME" >/dev/null
|
||||
;;
|
||||
*" stopped")
|
||||
warn "VM stopped during retry path; starting $VM_NAME"
|
||||
prlctl start "$VM_NAME" >/dev/null
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
run_windows_retry() {
|
||||
local label="$1"
|
||||
local max_attempts="$2"
|
||||
|
|
@ -463,7 +480,12 @@ run_windows_retry() {
|
|||
fi
|
||||
warn "$label attempt $attempt failed (rc=$rc)"
|
||||
if (( attempt < max_attempts )); then
|
||||
wait_for_guest_ready >/dev/null 2>&1 || true
|
||||
if ! ensure_vm_running_for_retry >/dev/null 2>&1; then
|
||||
:
|
||||
fi
|
||||
if ! wait_for_guest_ready >/dev/null 2>&1; then
|
||||
:
|
||||
fi
|
||||
sleep 5
|
||||
fi
|
||||
done
|
||||
|
|
|
|||
Loading…
Reference in New Issue