Summary
Four forward-only follow-ups from the runners-fleet incident in #10858. No in-code migration of legacy hosts.
1. /etc/pf.conf anchor block could end up duplicated
The prior idempotency check appended anchor "tuist.runners" to /etc/pf.conf unless a grep match was found. On k6kwt the block ended up appended twice; pfctl’s validate then died on cannot define table vm_sources: Resource busy (loading the same anchor twice tries to redeclare its tables).
Fix: marker-delimited atomic block management (# BEGIN tuist.runners / # END tuist.runners). Strip-and-append is idempotent and convergent.
2. /etc/kcpassword silently replaced with <sealed> on password mismatch
When the password the bootstrap module XOR-encodes into /etc/kcpassword doesn’t match m1‘s actual password, macOS Tahoe’s loginwindow rejects the auto-login attempt and overwrites /etc/kcpassword with the XOR-encoded literal string <sealed>. Auto-login never fires, no Aqua session, Tart’s ensure gui session times out at 30s, every runner pod hits TartCreateFailed forever — silent degradation across four production hosts today.
Fix: after writing kcpassword and the SIGHUP wait, decode the file as root and fail bootstrap if the first 8 bytes are <sealed> (via set -e).
3. launchctl kickstart instead of killall -HUP loginwindow
On k6kwt today we found loginwindow in a state where it had received SIGHUP from an earlier reconcile attempt and exited; launchd’s no-auto-respawn-after-SIGHUP policy for console-bound daemons left it dead. killall -HUP loginwindow then exits 1 with No matching processes were found — no process to kick. The pod-side error surfaced as tart run: ensure gui session: kick loginwindow: exit status 1.
Fix: launchctl kickstart -k system/com.apple.loginwindow talks to launchd’s service registry directly and handles both “process running, respawn it” and “process gone, spawn fresh” cases uniformly. Applied to both bootstrap.go’s enableAutoLogin and prepare-fleet-host.sh’s kcpassword step.
4. tart-kubelet reconciler stuck 504 Succeeded pods in Terminating
infra/tart-kubelet/internal/podagent/reconciler.go had the wrong branch ordering. The terminal-phase early-return (PodSucceeded/PodFailed) sat above the DeletionTimestamp check. Once the runners-controller observed a Pod’s natural turnover (VM exited → Phase=Succeeded) and issued a Delete on it, both conditions held simultaneously — and the reconciler short-circuited at the terminal-phase check, never reaching the deletion branch that removes the tart-kubelet.tuist.dev/vm-cleanup finalizer.
Result: every successfully-completed runner Pod since 2026-05-15 sat in Terminating with the finalizer holding it open. By 2026-05-19 there were 504 stuck pods in the tuist-runners namespace. The runners-controller’s reap path correctly skipped them (DeletionTimestamp already set) and the controller didn’t know there was a problem — its log showed reaped=0 every reconcile.
Fix: swap the order. DeletionTimestamp first (drop the finalizer, force-complete the API object deletion); terminal-phase early-return after. Added a regression test (TestReconcileTerminalPodWithDeletionTimestampRemovesFinalizer) that fails when the order is swapped back — verified by temporarily reverting the fix and confirming the test reproduces the bug.
Cleanup of existing stuck pods
This fix prevents new pods from getting stuck. The 504 already-stuck pods need a one-shot manual cleanup (separate from this PR):
kubectl -n tuist-runners get pods --field-selector status.phase=Succeeded \
-o json | jq -r '.items[] | select(.metadata.finalizers != null) | .metadata.name' \
| xargs -I{} kubectl -n tuist-runners patch pod {} \
--type=json -p='[{"op":"remove","path":"/metadata/finalizers"}]'
Once this PR’s tart-kubelet binary ships, future pods drain through cleanly without operator action.
Test plan
-
go test ./... passes across both infra/macos-host-bootstrap and infra/tart-kubelet/internal/podagent.
-
go vet clean. gofmt clean. bash -n clean on the script.
- Regression test for fix #4 verified by temporarily reverting reconciler.go and confirming the test fails with the bug present, passes with the fix.
- (blocking draft) Trigger a forced re-bootstrap on a healthy production host (e.g. by
kubectl delete node) and confirm fixes #1-#3 run without regression.
- (blocking draft) After deploy, confirm fix #4: a Pod that hits
Succeeded TartRunExited reaches deletion within a reconcile cycle instead of getting stuck in Terminating with the finalizer.
Context
See the #10858 description for the full incident write-up.