fix(infra): four forward-only bootstrap fixes from the runners-fleet incident

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #10875

Updated

Jun 24, 2026

Domains

Compute

Details

Summary

Four forward-only follow-ups from the runners-fleet incident in #10858. No in-code migration of legacy hosts.

1. `/etc/pf.conf` anchor block could end up duplicated

The prior idempotency check appended anchor "tuist.runners" to /etc/pf.conf unless a grep match was found. On k6kwt the block ended up appended twice; pfctl’s validate then died on cannot define table vm_sources: Resource busy (loading the same anchor twice tries to redeclare its tables).

Fix: marker-delimited atomic block management (# BEGIN tuist.runners / # END tuist.runners). Strip-and-append is idempotent and convergent.

2. `/etc/kcpassword` silently replaced with `<sealed>` on password mismatch

When the password the bootstrap module XOR-encodes into /etc/kcpassword doesn’t match m1‘s actual password, macOS Tahoe’s loginwindow rejects the auto-login attempt and overwrites /etc/kcpassword with the XOR-encoded literal string <sealed>. Auto-login never fires, no Aqua session, Tart’s ensure gui session times out at 30s, every runner pod hits TartCreateFailed forever — silent degradation across four production hosts today.

Fix: after writing kcpassword and the SIGHUP wait, decode the file as root and fail bootstrap if the first 8 bytes are <sealed> (via set -e).

3. `launchctl kickstart` instead of `killall -HUP loginwindow`

On k6kwt today we found loginwindow in a state where it had received SIGHUP from an earlier reconcile attempt and exited; launchd’s no-auto-respawn-after-SIGHUP policy for console-bound daemons left it dead. killall -HUP loginwindow then exits 1 with No matching processes were found — no process to kick. The pod-side error surfaced as tart run: ensure gui session: kick loginwindow: exit status 1.

Fix: launchctl kickstart -k system/com.apple.loginwindow talks to launchd’s service registry directly and handles both “process running, respawn it” and “process gone, spawn fresh” cases uniformly. Applied to both bootstrap.go’s enableAutoLogin and prepare-fleet-host.sh’s kcpassword step.

4. tart-kubelet reconciler stuck 504 Succeeded pods in Terminating

infra/tart-kubelet/internal/podagent/reconciler.go had the wrong branch ordering. The terminal-phase early-return (PodSucceeded/PodFailed) sat above the DeletionTimestamp check. Once the runners-controller observed a Pod’s natural turnover (VM exited → Phase=Succeeded) and issued a Delete on it, both conditions held simultaneously — and the reconciler short-circuited at the terminal-phase check, never reaching the deletion branch that removes the tart-kubelet.tuist.dev/vm-cleanup finalizer.

Result: every successfully-completed runner Pod since 2026-05-15 sat in Terminating with the finalizer holding it open. By 2026-05-19 there were 504 stuck pods in the tuist-runners namespace. The runners-controller’s reap path correctly skipped them (DeletionTimestamp already set) and the controller didn’t know there was a problem — its log showed reaped=0 every reconcile.

Fix: swap the order. DeletionTimestamp first (drop the finalizer, force-complete the API object deletion); terminal-phase early-return after. Added a regression test (TestReconcileTerminalPodWithDeletionTimestampRemovesFinalizer) that fails when the order is swapped back — verified by temporarily reverting the fix and confirming the test reproduces the bug.

Cleanup of existing stuck pods

This fix prevents new pods from getting stuck. The 504 already-stuck pods need a one-shot manual cleanup (separate from this PR):

kubectl -n tuist-runners get pods --field-selector status.phase=Succeeded \
  -o json | jq -r '.items[] | select(.metadata.finalizers != null) | .metadata.name' \
  | xargs -I{} kubectl -n tuist-runners patch pod {} \
    --type=json -p='[{"op":"remove","path":"/metadata/finalizers"}]'

Once this PR’s tart-kubelet binary ships, future pods drain through cleanly without operator action.

Test plan

go test ./... passes across both infra/macos-host-bootstrap and infra/tart-kubelet/internal/podagent.
go vet clean. gofmt clean. bash -n clean on the script.
Regression test for fix #4 verified by temporarily reverting reconciler.go and confirming the test fails with the bug present, passes with the fix.
(blocking draft) Trigger a forced re-bootstrap on a healthy production host (e.g. by kubectl delete node) and confirm fixes #1-#3 run without regression.
(blocking draft) After deploy, confirm fix #4: a Pod that hits Succeeded TartRunExited reaches deletion within a reconcile cycle instead of getting stuck in Terminating with the finalizer.

Context

See the #10858 description for the full incident write-up.

Comments

pepicrft May 20, 2026

I found two follow-ups worth addressing:

infra/macos-host-bootstrap/bootstrap.go: the new marker-based pf.conf rewrite only removes marker-delimited blocks. Legacy hosts already have the old unmarked anchor "tuist.runners" / load anchor ... stanza, so the first re-bootstrap of an existing host will keep that stanza and append the new marked block. That recreates the double-load state and can still trip pfctl -nf /etc/pf.conf. I think this needs to strip or replace the pre-marker form too.
infra/mise/tasks/k8s/prepare-fleet-host.sh: the prep script got the launchctl kickstart change, but it still does not verify whether macOS rewrote /etc/kcpassword to <sealed>. Once m1-nopasswd already exists, a mistyped password can still rewrite kcpassword and the script exits green. The controller later says it may skip enableAutoLogin, so the manual recovery path still keeps the silent bad-kcpassword failure mode.

Approving to keep this moving, but I would like to close these gaps in a follow-up.