Hive Hive
Sign in

feat(server): instrument runners dispatch path and extend Grafana dashboard

GitHub issue · Closed

Metadata
Source
tuist/tuist #10850
Updated
Jun 24, 2026
Domains
Compute
Details

Summary

Adds health telemetry for the Tuist Runners (managed GitHub Actions runners) feature and extends the existing Grafana dashboard from a VM-boot-only view into a full operational dashboard.

Builds on the dashboard scaffolding added in #10761 — that PR shipped the tailnet-scraped tart_kubelet_vm_boot_duration_seconds panels (host-side); this PR adds the server-side dispatch / lifecycle signals on top of it.

What’s emitted

Event metrics (via :telemetry.execute in the runners code, projected into Prometheus by Tuist.Runners.PromExPlugin):

  • tuist_runners_job_enqueued_count{fleet} — webhook → CH enqueue.
  • tuist_runners_job_claim_count{fleet, outcome} and tuist_runners_job_queue_time_milliseconds_bucket{fleet} — claim and queue wait.
  • tuist_runners_job_running_count{fleet} and tuist_runners_job_queue_to_running_time_milliseconds_bucket{fleet} — JIT mint transition.
  • tuist_runners_job_completed_count{fleet, conclusion} and tuist_runners_job_run_time_milliseconds_bucket{fleet} / ..._total_time_milliseconds_bucket{fleet} — terminal state with timing breakdown.
  • tuist_runners_job_requeued_count{fleet} — release / stale recovery.
  • tuist_runners_dispatch_request_count{outcome} and tuist_runners_dispatch_request_duration_milliseconds_bucket{outcome} — wraps Tuist.Runners.dispatch_for_sa/2 to time the dispatch endpoint from the polling Pod’s perspective.
  • tuist_runners_webhook_count{action, outcome}workflow_job deliveries bucketed by action and outcome (handles ignored, no_account, runners_disabled, etc.).
  • tuist_runners_recovery_count{kind}StaleClaimsWorker and OrphanedRunnersWorker output by kind (stale_claim, orphan_requeued, orphan_completed).

Polled gauges (30s cadence, each guarded against missing infra):

  • tuist_runners_queue_length{fleet}runner_jobs in status='queued' per fleet (CH).
  • tuist_runners_claims_count{fleet, lifecycle_state} — active PG runner_claims per fleet × state.
  • tuist_runners_pool_replicas_desired{fleet, dispatch_label} and tuist_runners_pool_replicas_observed{...} — RunnerPool spec.replicas vs status.observedReplicas.
  • tuist_runners_accounts_enabled — accounts with runner_max_concurrent > 0.

Cardinality discipline: fleet is the only fan-out tag on event metrics (bounded by RunnerPool CR count, currently 1). Per-account fan-out is deliberately not tagged on event metrics; account-level views are exposed as aggregate gauges only.

Dashboard

infra/grafana-dashboards/runners.json extended from 4 panels to 32, organised into 7 sections:

  1. Overview — 5 stat tiles (queued, pre-mint, running, desired runners, accounts enabled) plus the existing Mac-minis-reporting tile.
  2. Throughput — enqueued/sec, completed/sec by conclusion (stacked, conclusion-coloured), claim+run rate, webhook deliveries.
  3. Latency — p50/p95/p99 timeseries for queue time, enqueue→running, run time, total time, plus a queue-time heatmap.
  4. Capacity — RunnerPool desired vs observed replicas, pool utilisation (running / observed), queue length per fleet, inflight claims split by lifecycle state.
  5. Dispatch endpoint — dispatch RPS by outcome (colour-coded served green / no_work_yet blue / errors red) and p50/p95/p99 latency.
  6. Recovery — recovery rate by kind to surface JIT-mint stalls and missed-completion webhooks.
  7. VM boot time — the three panels from #10761 preserved verbatim, gated by the existing \$pool variable.

A new \$fleet variable (RunnerPool name) drives the server-side panels; the existing \$pool variable (Mac mini fleet) still drives the tart-kubelet panels.

How to test locally

  • mix test test/tuist/runners/ — 46/46 passing, including the new Tuist.Runners.PromExPluginTest (5 tests covering each polling function plus the K8s-unavailable path).
  • mix credo --strict clean for the new files.
  • mix format --check-formatted clean.
  • python3 -m json.tool infra/grafana-dashboards/runners.json to confirm the dashboard JSON is valid.

What this is NOT

  • Doesn’t add alert rules. Alerts are managed in the Grafana Cloud UI, not in this repo. The dashboard panels carry threshold colours where useful so an operator can eyeball saturation.
  • Doesn’t tag event metrics by account. Per-account utilisation is a separate UX surface; tagging events here would explode cardinality without serving the operational health view.
  • Doesn’t change the runners data path. Every emission is a one-line :telemetry.execute after the existing side-effect; the dispatch endpoint timing wraps dispatch_for_sa/2 with System.monotonic_time and is a no-op on the happy path.
Comments

No GitHub comments yet.