Hive
feat(server, infra): add machine metrics to Tuist Runners CI jobs
GitHub issue · Closed
What
Adds machine metrics — CPU, Memory, Network, CPU I/O Wait, and Storage — to the Tuist Runners CI job detail page, end to end: the runner-side collector that produces the samples, the ingestion endpoint and storage, and the dashboard that charts them.
- A new Metrics tab with a five-chart grid, and a CPU/Memory/Network row rendered inside the CI Details card on the Overview for an at-a-glance read.
- Step ↔ graph correlation: hovering a step on the Overview shades that step’s time window across every chart, so resource usage reads against the step that produced it.
- Charts use uniform percent axes (0/25/50/75/100) and straight point-to-point segments, so real spikes and dips are visible rather than smoothed away.
- An empty state for jobs that have no samples yet.
Why
Runner machine metrics have no GitHub-side source the way logs do. Logs are pulled from the Actions Logs API after a job completes; machine metrics describe the runner Pod/VM, which only our own infrastructure can observe. So this reuses the Noora ECharts rendering the build machine-metrics charts already use and the runner_job_logs per-job time-series model — and adds the producer that only our infra can provide.
How
Producer — collector in the runners-controller (internal/podmetrics)
- A leader-only
manager.Runnableon a--metrics-sample-intervalcadence (default 15s; 0 disables). Each pass lists busy runner Pods (those carrying thetuist.dev/runner-pool-ownerlabel), groups them by node, and reads each node’s kubelet/stats/summaryonce through the apiserver node proxy (newnodes/proxyClusterRole verb). - Per Pod it builds a sample — CPU as a percentage of node-allocatable cores, memory working-set, ephemeral-storage used/total, and network rx/tx differenced into per-interval throughput — and POSTs it keyed by Pod name, reusing the
--sessions-urlbase and the SA-token auth. - It deliberately never learns job ids; the server resolves them (below). Sampling activates wherever
--sessions-urlis already configured, so the tab is functional on deploy rather than empty.
Ingestion — Pod-name-keyed, server resolves the job
POST /api/internal/runners/pods/:pod_name/metrics, authenticated with the runners-controller’s in-cluster ServiceAccount token (Kubernetes TokenReview + principal check) — the same boundary aspods/stopped, extracted intoTuistWeb.RunnerControllerAuth(andpods/stoppedmigrated onto it).- The endpoint is keyed by Pod name because the collector knows Pods, not jobs — that mapping lives in
runner_claims. The server resolves the Pod’s live claim toworkflow_job_id+account_id(Claims.by_pod_name/1), the same shape aspods/stopped. An unclaimed Pod (idle/warm, or its job already released the claim) is a 204 no-op, since the collector samples every busy Pod.
Storage
runner_job_machine_metricsClickHouse table —ReplacingMergeTreeon(workflow_job_id, timestamp)(re-delivered batches collapse on merge),PARTITION BY toYYYYMM(inserted_at)so the 90-day TTL drops whole monthly parts instead of mutating one ever-growing partition (the partition column matches the TTL column), matchingrunner_job_logs.- A
Tuist.Runners.JobMetricscontext (record/3,list_for_job/1using theargMax(..., inserted_at)dedup patternJobStepsuses), exported from theTuistboundary.
Dashboard
- Charts render through the Noora ECharts component on a time x-axis, which lets the
RunnerMetricsHighlighthook place amarkAreaband by the step’s[start, end]epoch window precisely (instead of snapping to a sample index). The band colour comes from a Noora overlay token so it tracks the theme (incl. light/dark), and the hook degrades gracefully if a chart isn’t present. The time axis is pinned to its first/last timestamps so labels don’t crowd in the narrow cards. - Dev seed generates realistic per-job traces (CPU spikes, memory ramp, storage creep, jagged network;
iowaitis 0 on macOS fleets, which have no iowait accounting) so the UI is fully exercisable locally.
Scope / follow-up
This lands the feature end to end (collector + ingestion + storage + UI), gated in production behind the existing :runners beta flag. The remaining follow-up is validating the metric modeling against real telemetry: CPU% is taken relative to node-allocatable cores, cpu_iowait_percent is 0 (not exposed by the kubelet summary), and disk comes from pod ephemeral-storage. These choices are reasoned but unverified against a live fleet and may want tuning once data flows — the schema tolerates 0s and missing dimensions.
Validation
- Server:
mix compile --warnings-as-errorsclean;mix credoclean; Elixir/JS/CSS formatted;gettext.extractrun (no.pofiles touched);data-export.mdupdated for the new table. Tests pass — theJobMetricscontext, the Pod-name endpoint (auth / validation / unclaimed-no-op / records-under-claimed-job),Claims.by_pod_name/1, the LiveView Metrics tab + empty state, and the refactoredpods/stoppedsuite (no regression). - runners-controller (Go):
go test ./...,go vet ./..., andgofmtall clean — covering the kubelet summary parsing, the POST client, and the sampler (busy-only Pod selection, CPU%/memory/disk mapping, network deltas across passes, and stale-Pod pruning). - Verified end-to-end in the local dashboard: opened the Metrics tab and the in-card Overview row for a Linux job — all five charts render with uniform percent axes and sharp point-to-point lines; the empty state shows for jobs without samples; hovering a step shades the matching
[start, end]window across the CPU/Memory/Network charts in sync, confirming the correlation hook.
🤖 Generated with Claude Code
No GitHub comments yet.