feat(server, infra): add machine metrics to Tuist Runners CI jobs

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11458

Updated

Jun 25, 2026

Domains

Compute

Details

What

Adds machine metrics — CPU, Memory, Network, CPU I/O Wait, and Storage — to the Tuist Runners CI job detail page, end to end: the runner-side collector that produces the samples, the ingestion endpoint and storage, and the dashboard that charts them.

A new Metrics tab with a five-chart grid, and a CPU/Memory/Network row rendered inside the CI Details card on the Overview for an at-a-glance read.
Step ↔ graph correlation: hovering a step on the Overview shades that step’s time window across every chart, so resource usage reads against the step that produced it.
Charts use uniform percent axes (0/25/50/75/100) and straight point-to-point segments, so real spikes and dips are visible rather than smoothed away.
An empty state for jobs that have no samples yet.

Why

Runner machine metrics have no GitHub-side source the way logs do. Logs are pulled from the Actions Logs API after a job completes; machine metrics describe the runner Pod/VM, which only our own infrastructure can observe. So this reuses the Noora ECharts rendering the build machine-metrics charts already use and the runner_job_logs per-job time-series model — and adds the producer that only our infra can provide.

How

Producer — collector in the runners-controller (internal/podmetrics)

A leader-only manager.Runnable on a --metrics-sample-interval cadence (default 15s; 0 disables). Each pass lists busy runner Pods (those carrying the tuist.dev/runner-pool-owner label), groups them by node, and reads each node’s kubelet /stats/summary once through the apiserver node proxy (new nodes/proxy ClusterRole verb).
Per Pod it builds a sample — CPU as a percentage of node-allocatable cores, memory working-set, ephemeral-storage used/total, and network rx/tx differenced into per-interval throughput — and POSTs it keyed by Pod name, reusing the --sessions-url base and the SA-token auth.
It deliberately never learns job ids; the server resolves them (below). Sampling activates wherever --sessions-url is already configured, so the tab is functional on deploy rather than empty.

Ingestion — Pod-name-keyed, server resolves the job

POST /api/internal/runners/pods/:pod_name/metrics, authenticated with the runners-controller’s in-cluster ServiceAccount token (Kubernetes TokenReview + principal check) — the same boundary as pods/stopped, extracted into TuistWeb.RunnerControllerAuth (and pods/stopped migrated onto it).
The endpoint is keyed by Pod name because the collector knows Pods, not jobs — that mapping lives in runner_claims. The server resolves the Pod’s live claim to workflow_job_id + account_id (Claims.by_pod_name/1), the same shape as pods/stopped. An unclaimed Pod (idle/warm, or its job already released the claim) is a 204 no-op, since the collector samples every busy Pod.

Storage

runner_job_machine_metrics ClickHouse table — ReplacingMergeTree on (workflow_job_id, timestamp) (re-delivered batches collapse on merge), PARTITION BY toYYYYMM(inserted_at) so the 90-day TTL drops whole monthly parts instead of mutating one ever-growing partition (the partition column matches the TTL column), matching runner_job_logs.
A Tuist.Runners.JobMetrics context (record/3, list_for_job/1 using the argMax(..., inserted_at) dedup pattern JobSteps uses), exported from the Tuist boundary.

Dashboard

Charts render through the Noora ECharts component on a time x-axis, which lets the RunnerMetricsHighlight hook place a markArea band by the step’s [start, end] epoch window precisely (instead of snapping to a sample index). The band colour comes from a Noora overlay token so it tracks the theme (incl. light/dark), and the hook degrades gracefully if a chart isn’t present. The time axis is pinned to its first/last timestamps so labels don’t crowd in the narrow cards.
Dev seed generates realistic per-job traces (CPU spikes, memory ramp, storage creep, jagged network; iowait is 0 on macOS fleets, which have no iowait accounting) so the UI is fully exercisable locally.

Scope / follow-up

This lands the feature end to end (collector + ingestion + storage + UI), gated in production behind the existing :runners beta flag. The remaining follow-up is validating the metric modeling against real telemetry: CPU% is taken relative to node-allocatable cores, cpu_iowait_percent is 0 (not exposed by the kubelet summary), and disk comes from pod ephemeral-storage. These choices are reasoned but unverified against a live fleet and may want tuning once data flows — the schema tolerates 0s and missing dimensions.

Validation

Server: mix compile --warnings-as-errors clean; mix credo clean; Elixir/JS/CSS formatted; gettext.extract run (no .po files touched); data-export.md updated for the new table. Tests pass — the JobMetrics context, the Pod-name endpoint (auth / validation / unclaimed-no-op / records-under-claimed-job), Claims.by_pod_name/1, the LiveView Metrics tab + empty state, and the refactored pods/stopped suite (no regression).
runners-controller (Go): go test ./..., go vet ./..., and gofmt all clean — covering the kubelet summary parsing, the POST client, and the sampler (busy-only Pod selection, CPU%/memory/disk mapping, network deltas across passes, and stale-Pod pruning).
Verified end-to-end in the local dashboard: opened the Metrics tab and the in-card Overview row for a Linux job — all five charts render with uniform percent axes and sharp point-to-point lines; the empty state shows for jobs without samples; hovering a step shades the matching [start, end] window across the CPU/Memory/Network charts in sync, confirming the correlation hook.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.