feat(server): instrument runners dispatch path and extend Grafana dashboard

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #10850

Updated

Jun 24, 2026

Domains

Compute

Details

Summary

Adds health telemetry for the Tuist Runners (managed GitHub Actions runners) feature and extends the existing Grafana dashboard from a VM-boot-only view into a full operational dashboard.

Builds on the dashboard scaffolding added in #10761 — that PR shipped the tailnet-scraped tart_kubelet_vm_boot_duration_seconds panels (host-side); this PR adds the server-side dispatch / lifecycle signals on top of it.

What’s emitted

Event metrics (via :telemetry.execute in the runners code, projected into Prometheus by Tuist.Runners.PromExPlugin):

tuist_runners_job_enqueued_count{fleet} — webhook → CH enqueue.
tuist_runners_job_claim_count{fleet, outcome} and tuist_runners_job_queue_time_milliseconds_bucket{fleet} — claim and queue wait.
tuist_runners_job_running_count{fleet} and tuist_runners_job_queue_to_running_time_milliseconds_bucket{fleet} — JIT mint transition.
tuist_runners_job_completed_count{fleet, conclusion} and tuist_runners_job_run_time_milliseconds_bucket{fleet} / ..._total_time_milliseconds_bucket{fleet} — terminal state with timing breakdown.
tuist_runners_job_requeued_count{fleet} — release / stale recovery.
tuist_runners_dispatch_request_count{outcome} and tuist_runners_dispatch_request_duration_milliseconds_bucket{outcome} — wraps Tuist.Runners.dispatch_for_sa/2 to time the dispatch endpoint from the polling Pod’s perspective.
tuist_runners_webhook_count{action, outcome} — workflow_job deliveries bucketed by action and outcome (handles ignored, no_account, runners_disabled, etc.).
tuist_runners_recovery_count{kind} — StaleClaimsWorker and OrphanedRunnersWorker output by kind (stale_claim, orphan_requeued, orphan_completed).

Polled gauges (30s cadence, each guarded against missing infra):

tuist_runners_queue_length{fleet} — runner_jobs in status='queued' per fleet (CH).
tuist_runners_claims_count{fleet, lifecycle_state} — active PG runner_claims per fleet × state.
tuist_runners_pool_replicas_desired{fleet, dispatch_label} and tuist_runners_pool_replicas_observed{...} — RunnerPool spec.replicas vs status.observedReplicas.
tuist_runners_accounts_enabled — accounts with runner_max_concurrent > 0.

Cardinality discipline: fleet is the only fan-out tag on event metrics (bounded by RunnerPool CR count, currently 1). Per-account fan-out is deliberately not tagged on event metrics; account-level views are exposed as aggregate gauges only.

Dashboard

infra/grafana-dashboards/runners.json extended from 4 panels to 32, organised into 7 sections:

Overview — 5 stat tiles (queued, pre-mint, running, desired runners, accounts enabled) plus the existing Mac-minis-reporting tile.
Throughput — enqueued/sec, completed/sec by conclusion (stacked, conclusion-coloured), claim+run rate, webhook deliveries.
Latency — p50/p95/p99 timeseries for queue time, enqueue→running, run time, total time, plus a queue-time heatmap.
Capacity — RunnerPool desired vs observed replicas, pool utilisation (running / observed), queue length per fleet, inflight claims split by lifecycle state.
Dispatch endpoint — dispatch RPS by outcome (colour-coded served green / no_work_yet blue / errors red) and p50/p95/p99 latency.
Recovery — recovery rate by kind to surface JIT-mint stalls and missed-completion webhooks.
VM boot time — the three panels from #10761 preserved verbatim, gated by the existing \$pool variable.

A new \$fleet variable (RunnerPool name) drives the server-side panels; the existing \$pool variable (Mac mini fleet) still drives the tart-kubelet panels.

How to test locally

mix test test/tuist/runners/ — 46/46 passing, including the new Tuist.Runners.PromExPluginTest (5 tests covering each polling function plus the K8s-unavailable path).
mix credo --strict clean for the new files.
mix format --check-formatted clean.
python3 -m json.tool infra/grafana-dashboards/runners.json to confirm the dashboard JSON is valid.

What this is NOT

Doesn’t add alert rules. Alerts are managed in the Grafana Cloud UI, not in this repo. The dashboard panels carry threshold colours where useful so an operator can eyeball saturation.
Doesn’t tag event metrics by account. Per-account utilisation is a separate UX surface; tagging events here would explode cardinality without serving the operational health view.
Doesn’t change the runners data path. Every emission is a one-line :telemetry.execute after the existing side-effect; the dispatch endpoint timing wraps dispatch_for_sa/2 with System.monotonic_time and is a no-op on the happy path.

Comments

No GitHub comments yet.