TL;DR
Fixes the macOS runner queue clog. Empirical investigation (a spare Mac mini + live prod telemetry) showed the bottleneck is not VM boot time — it’s the image-pull wave on a digest roll (every node tart pulls the new ~tens-of-GB image at once, collapsing the warm pool), plus a secondary tart-kubelet stall failure mode. This PR addresses both, reverts the disproven boot-time hardening, and adds the observability to see it.
How we know
- On a spare Mac mini running the real
tuist-runner:macos-26-5 image: clone→IP ~9s, boot→runner-polling ~32s. Boot is fast, and Spotlight is already disabled in the base, so the original mdutil hardening was a no-op (reverted).
- Live prod: pods sit
Pending for minutes on Ready, free, no-pressure nodes with zero events between Scheduled and Running — and a 0.6.0 → 0.7.0 digest roll was in flight. The tart_kubelet_vm_boot_duration metric only covers tart run→IP, so pull/clone time was invisible. The “stalls for minutes, then recovers” signature is a finite pull completing, not boot.
What’s in this PR
1. Cap concurrent image rolls (the primary fix) — server + runners-controller + CRD
The old server drain was time-staggered (8 slots × 30s) but open-loop; 30s ≪ a multi-minute pull, so the whole fleet pulled at once. Replaced with a feedback-driven, controller-owned cap:
- CRD
spec.rollout.maxConcurrentPercent (default 5, min 1, max 100); cap = max(1, floor(pct/100 × replicas)).
- The controller marks stale Pods
tuist.dev/drain-eligible only up to the cap (counting in-flight = drain-eligible stale + current-image not-Ready), marking more only as rollers reach Ready. Controller-owned to avoid a server-side thundering-herd race.
- The server 410s a stale Pod only when it carries the label.
- New metrics:
tuist_runners_pool_rolling_pods, tuist_runners_pool_stale_pods, tuist_runners_pool_roll_concurrency_cap — so roll progress and a stuck roll (stale flat > 0 with rolling pinned) are visible.
2. tart-kubelet self-heal + observability (secondary stall failure mode)
- Per-op timeouts on
tart pull/clone/set/stop/delete/get/list — a hung op is killed and retried instead of wedging the node’s reconcile.
- Explicit requeue after the finalizer-add so a missed watch event can’t strand a Pod
Pending.
tart_kubelet_pod_provision_delay_seconds (pod-created → provisioning-start) + a CreatingVM event so the Scheduled→Running gap is no longer invisible.
3. Reverted + kept
- Reverted the first-boot image hardening (disproven).
- Kept the Runners dashboard changes (macOS ready/cold-booting split, utilisation-denominator fix, VM-boot panel).
Validation
go build/vet/test green incl. new unit tests (cap/readiness, watchdog); server compiles --warnings-as-errors and formats clean; the drain tests are rewritten to the label protocol; packer fmt clean. controller-gen was unavailable, so the CRD deepcopy + schema were hand-edited (mirroring autoscaling) — a maintainer should run make generate to confirm they match. Cannot integration-test the controller/node-agent off-cluster — needs canary validation through a real digest roll (watch tuist_runners_pool_rolling_pods stay ≤ ..._roll_concurrency_cap).
Follow-ups
- A dashboard panel + alert on the new roll metrics (PromQL:
rolling_pods > roll_concurrency_cap = cap bug; stale_pods > 0 flat = stuck roll). Metrics are exported now; the panel is cosmetic.
- Pre-pulling the new image fleet-wide before rolling pods (eliminate the pull from the critical path entirely) is the deeper optimization beyond capping concurrency.
🤖 Generated with Claude Code