fix(infra): fix the macOS runner queue bottleneck (image-roll pacing + tart-kubelet self-heal/observability)

Metadata

Source

tuist/tuist #11362

Updated

Jun 24, 2026

Domains

Compute

Details

TL;DR

Fixes the macOS runner queue clog. Empirical investigation (a spare Mac mini + live prod telemetry) showed the bottleneck is not VM boot time — it’s the image-pull wave on a digest roll (every node tart pulls the new ~tens-of-GB image at once, collapsing the warm pool), plus a secondary tart-kubelet stall failure mode. This PR addresses both, reverts the disproven boot-time hardening, and adds the observability to see it.

How we know

On a spare Mac mini running the real tuist-runner:macos-26-5 image: clone→IP ~9s, boot→runner-polling ~32s. Boot is fast, and Spotlight is already disabled in the base, so the original mdutil hardening was a no-op (reverted).
Live prod: pods sit Pending for minutes on Ready, free, no-pressure nodes with zero events between Scheduled and Running — and a 0.6.0 → 0.7.0 digest roll was in flight. The tart_kubelet_vm_boot_duration metric only covers tart run→IP, so pull/clone time was invisible. The “stalls for minutes, then recovers” signature is a finite pull completing, not boot.

What’s in this PR

1. Cap concurrent image rolls (the primary fix) — `server` + `runners-controller` + CRD

The old server drain was time-staggered (8 slots × 30s) but open-loop; 30s ≪ a multi-minute pull, so the whole fleet pulled at once. Replaced with a feedback-driven, controller-owned cap:

CRD spec.rollout.maxConcurrentPercent (default 5, min 1, max 100); cap = max(1, floor(pct/100 × replicas)).
The controller marks stale Pods tuist.dev/drain-eligible only up to the cap (counting in-flight = drain-eligible stale + current-image not-Ready), marking more only as rollers reach Ready. Controller-owned to avoid a server-side thundering-herd race.
The server 410s a stale Pod only when it carries the label.
New metrics: tuist_runners_pool_rolling_pods, tuist_runners_pool_stale_pods, tuist_runners_pool_roll_concurrency_cap — so roll progress and a stuck roll (stale flat > 0 with rolling pinned) are visible.

2. tart-kubelet self-heal + observability (secondary stall failure mode)

Per-op timeouts on tart pull/clone/set/stop/delete/get/list — a hung op is killed and retried instead of wedging the node’s reconcile.
Explicit requeue after the finalizer-add so a missed watch event can’t strand a Pod Pending.
tart_kubelet_pod_provision_delay_seconds (pod-created → provisioning-start) + a CreatingVM event so the Scheduled→Running gap is no longer invisible.

3. Reverted + kept

Reverted the first-boot image hardening (disproven).
Kept the Runners dashboard changes (macOS ready/cold-booting split, utilisation-denominator fix, VM-boot panel).

Validation

go build/vet/test green incl. new unit tests (cap/readiness, watchdog); server compiles --warnings-as-errors and formats clean; the drain tests are rewritten to the label protocol; packer fmt clean. controller-gen was unavailable, so the CRD deepcopy + schema were hand-edited (mirroring autoscaling) — a maintainer should run make generate to confirm they match. Cannot integration-test the controller/node-agent off-cluster — needs canary validation through a real digest roll (watch tuist_runners_pool_rolling_pods stay ≤ ..._roll_concurrency_cap).

Follow-ups

A dashboard panel + alert on the new roll metrics (PromQL: rolling_pods > roll_concurrency_cap = cap bug; stale_pods > 0 flat = stuck roll). Metrics are exported now; the panel is cosmetic.
Pre-pulling the new image fleet-wide before rolling pods (eliminate the pull from the critical path entirely) is the deeper optimization beyond capping concurrency.

🤖 Generated with Claude Code

Comments

F

fortmarek Jun 19, 2026

Thanks for the focused repros — all four addressed in d301ce6:

Pending Pods bypassed the cap. Confirmed: stale Pending Pods were reaped en masse before the cap logic. They now retire under the same roll budget as the macOS Running→410 path (reap idle Pending directly, mark idle Running drain-eligible). The throttle moved ahead of the gap-fill so a reaped Pod’s current-image replacement is created the same reconcile. Regression test added: 5 stale Pending + 40% cap → exactly 2 retired/tick, 3 keep serving the old image.
Roll gauges leaked for deleted static pools. metrics.Clear now also runs on the controller’s NotFound (pool-gone) path, not just the autoscaler path.
rollout knob not rendered. runner-pool.yaml now renders spec.rollout from each pool/override/shape across all four RunnerPool blocks. Verified with helm template: a legacy pool with rollout.maxConcurrentPercent=20 and a 26.5 xcodeOverride with 7 both render; pools without an override render none (controller default 5 applies).
provision_delay double-counted on retries. Moved the observe to the success path (right before tart run), recorded once per Pod that reaches Run. It now spans pull+clone — the segment the boot histogram (which starts at tart run) can’t see — and the Help text reflects that.

go build/vet/test green on both modules incl. the new throttle test; helm template renders the knob.

TA

tuist-atlas[bot] Jun 20, 2026

The fix for the macOS runner queue bottleneck (image-roll pacing plus tart-kubelet self-heal/observability) is now available in the xcresult-processor-image@0.26.4 release. Update your Tart image to ghcr.io/tuist/tuist-xcresult-processor:0.26.4 to pick up these changes.