Hive
feat(server): instrument runners dispatch path and extend Grafana dashboard
GitHub issue · Closed
Summary
Adds health telemetry for the Tuist Runners (managed GitHub Actions runners) feature and extends the existing Grafana dashboard from a VM-boot-only view into a full operational dashboard.
Builds on the dashboard scaffolding added in #10761 — that PR shipped the tailnet-scraped tart_kubelet_vm_boot_duration_seconds panels (host-side); this PR adds the server-side dispatch / lifecycle signals on top of it.
What’s emitted
Event metrics (via :telemetry.execute in the runners code, projected into Prometheus by Tuist.Runners.PromExPlugin):
tuist_runners_job_enqueued_count{fleet}— webhook → CH enqueue.tuist_runners_job_claim_count{fleet, outcome}andtuist_runners_job_queue_time_milliseconds_bucket{fleet}— claim and queue wait.tuist_runners_job_running_count{fleet}andtuist_runners_job_queue_to_running_time_milliseconds_bucket{fleet}— JIT mint transition.tuist_runners_job_completed_count{fleet, conclusion}andtuist_runners_job_run_time_milliseconds_bucket{fleet}/..._total_time_milliseconds_bucket{fleet}— terminal state with timing breakdown.tuist_runners_job_requeued_count{fleet}— release / stale recovery.tuist_runners_dispatch_request_count{outcome}andtuist_runners_dispatch_request_duration_milliseconds_bucket{outcome}— wrapsTuist.Runners.dispatch_for_sa/2to time the dispatch endpoint from the polling Pod’s perspective.tuist_runners_webhook_count{action, outcome}—workflow_jobdeliveries bucketed by action and outcome (handlesignored,no_account,runners_disabled, etc.).tuist_runners_recovery_count{kind}—StaleClaimsWorkerandOrphanedRunnersWorkeroutput by kind (stale_claim,orphan_requeued,orphan_completed).
Polled gauges (30s cadence, each guarded against missing infra):
tuist_runners_queue_length{fleet}—runner_jobsinstatus='queued'per fleet (CH).tuist_runners_claims_count{fleet, lifecycle_state}— active PGrunner_claimsper fleet × state.tuist_runners_pool_replicas_desired{fleet, dispatch_label}andtuist_runners_pool_replicas_observed{...}— RunnerPoolspec.replicasvsstatus.observedReplicas.tuist_runners_accounts_enabled— accounts withrunner_max_concurrent > 0.
Cardinality discipline: fleet is the only fan-out tag on event metrics (bounded by RunnerPool CR count, currently 1). Per-account fan-out is deliberately not tagged on event metrics; account-level views are exposed as aggregate gauges only.
Dashboard
infra/grafana-dashboards/runners.json extended from 4 panels to 32, organised into 7 sections:
- Overview — 5 stat tiles (queued, pre-mint, running, desired runners, accounts enabled) plus the existing Mac-minis-reporting tile.
- Throughput — enqueued/sec, completed/sec by conclusion (stacked, conclusion-coloured), claim+run rate, webhook deliveries.
- Latency — p50/p95/p99 timeseries for queue time, enqueue→running, run time, total time, plus a queue-time heatmap.
- Capacity — RunnerPool desired vs observed replicas, pool utilisation (running / observed), queue length per fleet, inflight claims split by lifecycle state.
- Dispatch endpoint — dispatch RPS by outcome (colour-coded
servedgreen /no_work_yetblue / errors red) and p50/p95/p99 latency. - Recovery — recovery rate by kind to surface JIT-mint stalls and missed-completion webhooks.
- VM boot time — the three panels from #10761 preserved verbatim, gated by the existing
\$poolvariable.
A new \$fleet variable (RunnerPool name) drives the server-side panels; the existing \$pool variable (Mac mini fleet) still drives the tart-kubelet panels.
How to test locally
mix test test/tuist/runners/— 46/46 passing, including the newTuist.Runners.PromExPluginTest(5 tests covering each polling function plus the K8s-unavailable path).mix credo --strictclean for the new files.mix format --check-formattedclean.python3 -m json.tool infra/grafana-dashboards/runners.jsonto confirm the dashboard JSON is valid.
What this is NOT
- Doesn’t add alert rules. Alerts are managed in the Grafana Cloud UI, not in this repo. The dashboard panels carry threshold colours where useful so an operator can eyeball saturation.
- Doesn’t tag event metrics by account. Per-account utilisation is a separate UX surface; tagging events here would explode cardinality without serving the operational health view.
- Doesn’t change the runners data path. Every emission is a one-line
:telemetry.executeafter the existing side-effect; the dispatch endpoint timing wrapsdispatch_for_sa/2withSystem.monotonic_timeand is a no-op on the happy path.
No GitHub comments yet.