Hive Hive
Sign in

feat(server, infra): make runner pool allocation and cold-boot risk observable

GitHub issue · Closed

Metadata
Source
tuist/tuist #11315
Updated
Jun 24, 2026
Domains
Compute
Details

What changed

Three coherent changes that make the Tuist-managed runner fleet’s pool allocation observable and fix two metrics that actively mislead.

1. Dispatch telemetry (server)

Tuist.Runners.dispatch_for_sa/2 collapsed three distinct failure reasons — :empty (queue genuinely had nothing), :lost_race, and :pod_in_use (claim contention) — into a single :no_work_yet outcome, and the dispatch metric carried no fleet tag. That combination makes a real per-shape dispatch stall indistinguishable from an idle warm pool just polling.

  • do_dispatch_for_sa now returns {result, fleet_name} with the granular reason preserved for telemetry. A new to_caller_result/1 collapses the no-work family back to :no_work_yet only at the boundary, so the HTTP contract (204 to the polling Pod) and all existing tests are untouched.
  • The dispatch count + latency metrics gain a :fleet tag (prom_ex_plugin.ex).

Outcomes are now served / drain / empty / lost_race / pod_in_use / …, sliceable per fleet.

2. New allocation metrics (runners-controller)

alive/desired replicas is a weak signal: it only shows the pool converging to its target, never that the target itself was squeezed below the configured warm floor by the fleet allocator. That squeeze is the leading indicator for cold boots.

New internal/metrics package, emitted per pool from desiredForPool on the controller’s existing metrics endpoint:

metric meaning
tuist_runners_autoscaler_target_replicas the pool’s full ask, pre-allocation (DesiredReplicas)
tuist_runners_autoscaler_allocated_replicas what the allocator granted (= spec.replicas)
tuist_runners_autoscaler_min_warm_floor_replicas configured minWarmPoolFloor
tuist_runners_autoscaler_warm_deficit_replicas warm floor the allocator could not fund under contention — the cold-boot leading indicator

warm_deficit = max(0, min(load+floor, target) − allocated), clamped so load starvation (a separate signal surfaced by Pending/unschedulable Pods) is not miscounted as warm-pool reaping. Series are cleared on pool delete / autoscaling opt-out.

3. Dashboard (infra/grafana-dashboards/runners.json)

  • Added an env template variable (default production) and scoped the headline + capacity selectors with env="$env". This kills the cross-env max by (fleet) aliasing: the Helm release name is identical across canary/staging/production, so the fleet label collides and max by (fleet) silently merged environments — the root reason a “queued vs running” snapshot looked impossible. Dispatch panels are intentionally left unscoped because that metric’s env is collapsed to <aggregated> by Grafana Cloud Adaptive Metrics.
  • New “Allocation (cold-boot risk)” row charting warm deficit and target-vs-allocated by platform.

Why

This came out of investigating a transient production backlog on the linux-2vcpu-8gb shape. The fleet allocator (allocate.go) deliberately reaps a pool’s speculative warm floor to fund other pools’ real queued work under shared-capacity contention. That tradeoff is sound, but it was invisible — no metric showed the warm floor being squeezed, and the dashboard’s env aliasing made the symptom look like a capacity or dispatch bug when it was neither. These changes make the allocation behavior legible and stop the headline tiles from lying.

Reviewer notes

  • The :no_work_yet HTTP behavior is unchanged on purpose; only the telemetry outcome is now granular. to_caller_result/1 is the single collapse point.
  • warm_deficit excludes headroom (the speculative p95 buffer above the floor) — only the warm guarantee counts toward cold-boot risk. See the new warmDeficit table test for the edge cases.
  • Adding fleet to the dispatch metric raises its series count to fleet × outcome (~O(10) × ~6), still bounded.
  • The lagging cold-start rate (fraction of claims that waited > ~10s) is derivable from the existing tuist_runners_job_queue_time_milliseconds histogram; a panel for it was deliberately not added pending confirmation that env survives on that histogram (it may be Adaptive-Metrics-aggregated like the dispatch counter).

Validation

  • runners-controller: go build, go vet, full go test, gofmt all clean. Added a warmDeficit table test (which caught a real bug — it was counting load starvation as warm deficit; fixed by clamping allocated up to load).
  • Dashboard JSON validated (parses; env var present; selectors env-scoped).
  • Server: both changed files parse clean. A full mix compile was not run — this git worktree has no deps/ fetched (separate from the main checkout); CI will compile-verify.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 17, 2026

The changes from this pull request are now available in xcresult-processor-image@0.24.1. Update to this version to use the improved runner pool allocation and cold-boot risk observability features.

TA
tuist-atlas[bot] Jun 17, 2026

The changes from this pull request are now available in server@1.213.3. You can update to this version to start using the new runner pool allocation observability and cold-boot risk metrics.

Docker image: ghcr.io/tuist/tuist:1.213.3

TA
tuist-atlas[bot] Jun 17, 2026

The changes to make runner pool allocation and cold-boot risk observable are now available. Update to runners-controller@0.12.1 to use this version.