feat(infra): rework runners dashboard for clarity and correctness

Metadata

Source

tuist/tuist #11220

Updated

Jun 24, 2026

Domains

Compute

Details

No linked issue: self-initiated cleanup of the Tuist Runners observability dashboard, prompted by a walk-through of every panel against what its metrics actually measure.

This reworks the Tuist Runners Grafana dashboard so that every panel is a real, working signal, plus a small supporting server metric. The dashboard goes from ~36 panels across 8 sections to 20 panels across 5.

Why

Panel by panel, a lot of the dashboard was empty, misleading, or redundant:

Several panels were built on status.observedReplicas, which the runners-controller derives from the target (spec.replicas - scaledDown), so it shadows desired and can never show the gap it was meant to. The Overview “Unscheduled” tile and the “Unscheduled replicas” panel both read ~0 by construction, not because the fleet was healthy.
Polling gauges were being double-counted: the server runs 2 replicas, both run the PromEx pollers, and Grafana Cloud scrapes per-pod, so queue_length / claims_count / pool_replicas_* each had two identical series per fleet and sum(...) doubled them. Headline tiles read ~2x the truth (e.g. “15 macOS running” on ~9 minis was really ~7-8).
Pool utilisation divided running claims by observedReplicas (floored at 1), so it spiked to 900%.
Whole sections were low-signal at the fleet’s low job volume (VM boot, per-second throughput rates) or measured the wrong thing (Pending pods are mostly the idle Linux warm pool, since the poll loop runs in an init container).

What changed

Server (`feat(server)`)

New tuist_runners_pool_replicas_max PromEx gauge per fleet = max(spec.autoscaling.maxReplicas, spec.replicas) (static pools fall back to their fixed replica count). Gives the dashboard a stable capacity ceiling so utilisation = running / ceiling reads as true saturation. Wired into the existing pool-replicas poll loop; test added for the autoscaling-ceiling path.

Dashboard (`feat(infra)`)

Queue-first layout: Capacity (queue length, running builds, RunnerPool replicas, pool utilisation) sits directly under the Overview tiles.
Platform filter (linux / macos / All) wired into every per-fleet/per-pod query via a regex match on the fleet/pod name. Values are plain tokens with the wildcards in the query so Grafana’s includeAll regex-escaping does not break the match; allValue .* makes “All” a no-op. Removed the unused Server job / Fleet / Mac mini pool variables and their matchers.
Multi-replica dedupe: all gauge panels and Overview tiles now max by (fleet) before aggregating; counter rates stay summed (correctly partitioned across pods).
Semantics fixes: RunnerPool replicas plots desired vs alive pods (kube_pod_status_phase Pending+Running) instead of observedReplicas; Overview “Unscheduled” uses kube_pod_status_unschedulable with or vector(0) (reads 0 when healthy); corrected the Pending / claimed descriptions.
Trimmed: VM boot section, Capacity-fit section, memory panel (linux allocatable is uncomputable because KSM does not export the custom node.cluster.x-k8s.io/pool label), Run/Total latency, and the Recovery chart (a permanently-empty “should be zero” box, better as an alert). Throughput collapsed to a single “Jobs accepted vs completed / sec” panel; Latency trimmed to Queue time + Claim to running with a wider [30m] window so sparse samples read as lines.

Known follow-ups (server-side, not in this PR)

Pool utilisation shows No data until the new tuist_runners_pool_replicas_max metric is deployed to the environment feeding this Prometheus (server change ships first).
completed/completed reads ~0 while runners are busy, which points at Tuist-accepted job completions not being attributed via the webhook path. Root cause is unconfirmed (it is not orphan-recovery; the Recovery counter shows zero). The authoritative pass/fail data lives in ClickHouse (runner_jobs.conclusion).

How to test locally

Dashboard: python3 -c "import json; json.load(open('infra/grafana-dashboards/runners.json'))" validates. Load it in Grafana against grafanacloud-tuist-prom and confirm: the Platform filter narrows Capacity/Throughput/Latency to linux or macos (All is a no-op); Overview tiles read sensibly (Queued/Running are real counts, Pre-mint and Unscheduled read 0 green when healthy); gauge panels are no longer ~2x inflated.
Server: cd server && mix format --check-formatted lib/tuist/runners/prom_ex_plugin.ex test/tuist/runners/prom_ex_plugin_test.exs and mix test test/tuist/runners/prom_ex_plugin_test.exs. Note: these were not run in the authoring worktree (server deps not bootstrapped there), so CI is the first run; both files parse and the test uses partial-map asserts that tolerate the added measurement.
After deploy: Pool utilisation populates on the next 30s poll once the server emits tuist_runners_pool_replicas_max.

Comments

TA

tuist-atlas[bot] Jun 12, 2026

This change is now available in version xcresult-processor-image@0.18.0. Update to this version to get these improvements.