Hive
feat(infra): rework runners dashboard for clarity and correctness
GitHub issue · Closed
Source
tuist/tuist #11220
Updated
Jun 24, 2026
Domains
Compute
No linked issue: self-initiated cleanup of the Tuist Runners observability dashboard, prompted by a walk-through of every panel against what its metrics actually measure.
This reworks the Tuist Runners Grafana dashboard so that every panel is a real, working signal, plus a small supporting server metric. The dashboard goes from ~36 panels across 8 sections to 20 panels across 5.
Why
Panel by panel, a lot of the dashboard was empty, misleading, or redundant:
- Several panels were built on
status.observedReplicas, which the runners-controller derives from the target (spec.replicas - scaledDown), so it shadowsdesiredand can never show the gap it was meant to. The Overview “Unscheduled” tile and the “Unscheduled replicas” panel both read ~0 by construction, not because the fleet was healthy. - Polling gauges were being double-counted: the server runs 2 replicas, both run the PromEx pollers, and Grafana Cloud scrapes per-pod, so
queue_length/claims_count/pool_replicas_*each had two identical series per fleet andsum(...)doubled them. Headline tiles read ~2x the truth (e.g. “15 macOS running” on ~9 minis was really ~7-8). - Pool utilisation divided running claims by
observedReplicas(floored at 1), so it spiked to 900%. - Whole sections were low-signal at the fleet’s low job volume (VM boot, per-second throughput rates) or measured the wrong thing (Pending pods are mostly the idle Linux warm pool, since the poll loop runs in an init container).
What changed
Server (feat(server))
- New
tuist_runners_pool_replicas_maxPromEx gauge per fleet =max(spec.autoscaling.maxReplicas, spec.replicas)(static pools fall back to their fixed replica count). Gives the dashboard a stable capacity ceiling so utilisation =running / ceilingreads as true saturation. Wired into the existing pool-replicas poll loop; test added for the autoscaling-ceiling path.
Dashboard (feat(infra))
- Queue-first layout: Capacity (queue length, running builds, RunnerPool replicas, pool utilisation) sits directly under the Overview tiles.
Platformfilter (linux / macos / All) wired into every per-fleet/per-pod query via a regex match on the fleet/pod name. Values are plain tokens with the wildcards in the query so Grafana’sincludeAllregex-escaping does not break the match;allValue.*makes “All” a no-op. Removed the unused Server job / Fleet / Mac mini pool variables and their matchers.- Multi-replica dedupe: all gauge panels and Overview tiles now
max by (fleet)before aggregating; counter rates stay summed (correctly partitioned across pods). - Semantics fixes: RunnerPool replicas plots desired vs alive pods (
kube_pod_status_phasePending+Running) instead ofobservedReplicas; Overview “Unscheduled” useskube_pod_status_unschedulablewithor vector(0)(reads 0 when healthy); corrected the Pending / claimed descriptions. - Trimmed: VM boot section, Capacity-fit section, memory panel (linux allocatable is uncomputable because KSM does not export the custom
node.cluster.x-k8s.io/poollabel), Run/Total latency, and the Recovery chart (a permanently-empty “should be zero” box, better as an alert). Throughput collapsed to a single “Jobs accepted vs completed / sec” panel; Latency trimmed to Queue time + Claim to running with a wider[30m]window so sparse samples read as lines.
Known follow-ups (server-side, not in this PR)
- Pool utilisation shows No data until the new
tuist_runners_pool_replicas_maxmetric is deployed to the environment feeding this Prometheus (server change ships first). completed/completedreads ~0 while runners are busy, which points at Tuist-accepted job completions not being attributed via the webhook path. Root cause is unconfirmed (it is not orphan-recovery; the Recovery counter shows zero). The authoritative pass/fail data lives in ClickHouse (runner_jobs.conclusion).
How to test locally
- Dashboard:
python3 -c "import json; json.load(open('infra/grafana-dashboards/runners.json'))"validates. Load it in Grafana againstgrafanacloud-tuist-promand confirm: thePlatformfilter narrows Capacity/Throughput/Latency to linux or macos (All is a no-op); Overview tiles read sensibly (Queued/Running are real counts, Pre-mint and Unscheduled read 0 green when healthy); gauge panels are no longer ~2x inflated. - Server:
cd server && mix format --check-formatted lib/tuist/runners/prom_ex_plugin.ex test/tuist/runners/prom_ex_plugin_test.exsandmix test test/tuist/runners/prom_ex_plugin_test.exs. Note: these were not run in the authoring worktree (server deps not bootstrapped there), so CI is the first run; both files parse and the test uses partial-map asserts that tolerate the added measurement. - After deploy: Pool utilisation populates on the next 30s poll once the server emits
tuist_runners_pool_replicas_max.