Hive
feat(server, infra): make runner pool allocation and cold-boot risk observable
GitHub issue · Closed
What changed
Three coherent changes that make the Tuist-managed runner fleet’s pool allocation observable and fix two metrics that actively mislead.
1. Dispatch telemetry (server)
Tuist.Runners.dispatch_for_sa/2 collapsed three distinct failure reasons — :empty (queue genuinely had nothing), :lost_race, and :pod_in_use (claim contention) — into a single :no_work_yet outcome, and the dispatch metric carried no fleet tag. That combination makes a real per-shape dispatch stall indistinguishable from an idle warm pool just polling.
do_dispatch_for_sanow returns{result, fleet_name}with the granular reason preserved for telemetry. A newto_caller_result/1collapses the no-work family back to:no_work_yetonly at the boundary, so the HTTP contract (204 to the polling Pod) and all existing tests are untouched.- The dispatch count + latency metrics gain a
:fleettag (prom_ex_plugin.ex).
Outcomes are now served / drain / empty / lost_race / pod_in_use / …, sliceable per fleet.
2. New allocation metrics (runners-controller)
alive/desired replicas is a weak signal: it only shows the pool converging to its target, never that the target itself was squeezed below the configured warm floor by the fleet allocator. That squeeze is the leading indicator for cold boots.
New internal/metrics package, emitted per pool from desiredForPool on the controller’s existing metrics endpoint:
| metric | meaning |
|---|---|
tuist_runners_autoscaler_target_replicas |
the pool’s full ask, pre-allocation (DesiredReplicas) |
tuist_runners_autoscaler_allocated_replicas |
what the allocator granted (= spec.replicas) |
tuist_runners_autoscaler_min_warm_floor_replicas |
configured minWarmPoolFloor |
tuist_runners_autoscaler_warm_deficit_replicas |
warm floor the allocator could not fund under contention — the cold-boot leading indicator |
warm_deficit = max(0, min(load+floor, target) − allocated), clamped so load starvation (a separate signal surfaced by Pending/unschedulable Pods) is not miscounted as warm-pool reaping. Series are cleared on pool delete / autoscaling opt-out.
3. Dashboard (infra/grafana-dashboards/runners.json)
- Added an
envtemplate variable (default production) and scoped the headline + capacity selectors withenv="$env". This kills the cross-envmax by (fleet)aliasing: the Helm release name is identical across canary/staging/production, so the fleet label collides andmax by (fleet)silently merged environments — the root reason a “queued vs running” snapshot looked impossible. Dispatch panels are intentionally left unscoped because that metric’senvis collapsed to<aggregated>by Grafana Cloud Adaptive Metrics. - New “Allocation (cold-boot risk)” row charting warm deficit and target-vs-allocated by platform.
Why
This came out of investigating a transient production backlog on the linux-2vcpu-8gb shape. The fleet allocator (allocate.go) deliberately reaps a pool’s speculative warm floor to fund other pools’ real queued work under shared-capacity contention. That tradeoff is sound, but it was invisible — no metric showed the warm floor being squeezed, and the dashboard’s env aliasing made the symptom look like a capacity or dispatch bug when it was neither. These changes make the allocation behavior legible and stop the headline tiles from lying.
Reviewer notes
- The
:no_work_yetHTTP behavior is unchanged on purpose; only the telemetry outcome is now granular.to_caller_result/1is the single collapse point. warm_deficitexcludes headroom (the speculative p95 buffer above the floor) — only the warm guarantee counts toward cold-boot risk. See the newwarmDeficittable test for the edge cases.- Adding
fleetto the dispatch metric raises its series count to fleet × outcome (~O(10) × ~6), still bounded. - The lagging cold-start rate (fraction of claims that waited > ~10s) is derivable from the existing
tuist_runners_job_queue_time_millisecondshistogram; a panel for it was deliberately not added pending confirmation thatenvsurvives on that histogram (it may be Adaptive-Metrics-aggregated like the dispatch counter).
Validation
- runners-controller:
go build,go vet, fullgo test,gofmtall clean. Added awarmDeficittable test (which caught a real bug — it was counting load starvation as warm deficit; fixed by clamping allocated up to load). - Dashboard JSON validated (parses;
envvar present; selectors env-scoped). - Server: both changed files parse clean. A full
mix compilewas not run — this git worktree has nodeps/fetched (separate from the main checkout); CI will compile-verify.
🤖 Generated with Claude Code