feat(linux-runner-image): emit resource vitals so mid-job runner deaths are diagnosable

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11094

Updated

Jun 24, 2026

Domains

Compute

Details

What this does

Makes a Linux runner that dies mid-job (“self-hosted runner lost communication with the server”) diagnosable, end to end: a resource-vitals probe in the runner image, plus the log-collection change that actually gets its output to Grafana.

Why

When a Linux runner dies mid-job, there is currently no trail. The kata microVM is single-shot and reaped on exit, and every external vantage point is clean by construction: no OOMKilled (a guest-internal OOM never trips the host cgroup), no node SystemOOM, no eviction, no controller reap, and no GitHub incident. The failure happens inside the guest VM, which has zero telemetry. Observed across many deaths: they consistently land on the job’s heaviest step (docker build, dependency install, cache restore, tuist cache), never on light steps, which points at the workload starving or OOM-ing the runner agent inside its VM until its heartbeat to GitHub dies. This PR makes that provable instead of inferred.

The change (three parts)

vitals.sh probe (infra/linux-runner-image/). run-job.sh backgrounds it just before exec’ing the runner (and the dispatch-poll.sh rollout-bridge path does too), so it samples for the job’s lifetime and its last line before a death lands in the Pod logs. Every line is tagged RUNNER_VITALS and is logfmt-shaped, distinguishing the two leading death modes:
- Guest OOM: memory.events oom_kill increments, and/or a /dev/kmsg Out of memory: Killed process line.
- CPU/memory starvation: /proc/pressure/{cpu,memory} avg10 climbs, mem.current approaches mem.max.
- Fields: cgroup memory.current/peak/max/swap, memory.events (oom, oom_kill), guest-wide vm.mem.total/avail from /proc/meminfo, CPU/mem/io PSI avg10, loadavg, and a best-effort /dev/kmsg OOM watcher.
Log collection (infra/helm/k8s-monitoring). The probe writes to stdout, but the alloy-logs DaemonSet that ships Pod logs to Loki did not tolerate the bare-metal runner taint (tuist.dev/runner-tier=bare-metal:NoSchedule), so it never scheduled on the runner nodes and their logs never reached Loki. Without this, the probe would be a no-op in production. Added the toleration to the alloy-logs collector so the DaemonSet lands on the runner nodes too.
Review-feedback hardening on the probe:
- dind scope: the cgroup mem.*/oom_kill fields are scoped to the runner container, but heavy steps run in the dind sidecar. Added guest-wide vm.mem.total/avail (/proc/meminfo, the whole microVM), which together with the already-guest-wide PSI fields and the /dev/kmsg watcher cover a Docker/buildkit OOM or pressure living in the sidecar.
- interval clamp: TUIST_RUNNER_VITALS_INTERVAL is clamped to a positive integer so a 0/empty/non-numeric value cannot spin the loop and flood logs.
- CI: vitals.sh added to the linux-runner-image workflow’s bash -n syntax check.

How vitals surface

Grafana Explore, Loki datasource:

{namespace="tuist-runners"} |= "RUNNER_VITALS" | logfmt

The dead runner’s pod name equals the GitHub runner_name, so you filter to the exact victim and read its last samples. | logfmt exposes mem_current, oom_kill, cpu_psi_some_avg10, vm_mem_avail_kb, etc. as numeric fields, so it can be graphed and alerted on with no metrics-pipeline change.

Design notes

Fail-open: every read in vitals.sh is guarded; an unreadable file (PSI disabled in the guest kernel, cgroup v1, restricted /dev/kmsg) degrades to an empty field, never an error. It can never block or fail the runner, and only runs while a job executes, so idle warm Pods stay quiet.
Toleration placement: it sits under the collector’s controller: block deliberately. A top-level tolerations on the collector renders to spec.tolerations on the Alloy CR, which the alloy-operator ignores; controller.tolerations renders to spec.controller.tolerations, which it applies to the DaemonSet. Verified by helm template.

Scope and follow-ups

Linux only. macOS Tart VMs have no log egress (their logs die with the VM, and the alloy-logs Linux DaemonSet cannot run on macOS nodes), so a macOS probe needs a network sink (a SA-authed /api/internal/runners/vitals server endpoint). Tracked separately.
Diagnostic, not a fix. This deliberately changes no runner behavior; it exists so the next disconnect resolves to OOM vs starvation vs genuine external, after which the real fix (shape sizing, dind memory caps, CPU oversubscription) can be chosen with evidence.
A Grafana panel/alert (e.g. oom_kill > 0, per-pool memory/PSI) is intentionally deferred until this deploys and the field shape is confirmed against real Loki data.

Validation

bash -n and shellcheck -S warning clean on vitals.sh, run-job.sh, dispatch-poll.sh; interval clamp and /proc/meminfo parser unit-checked.
helm template confirms the toleration renders at spec.controller.tolerations on the alloy-logs Alloy CR.
The image’s own CI (linux-runner-image.yml on pull_request) builds the image.

Comments

No GitHub comments yet.