Hive
feat(linux-runner-image): emit resource vitals so mid-job runner deaths are diagnosable
GitHub issue · Closed
What this does
Makes a Linux runner that dies mid-job (“self-hosted runner lost communication with the server”) diagnosable, end to end: a resource-vitals probe in the runner image, plus the log-collection change that actually gets its output to Grafana.
Why
When a Linux runner dies mid-job, there is currently no trail. The kata microVM is single-shot and reaped on exit, and every external vantage point is clean by construction: no OOMKilled (a guest-internal OOM never trips the host cgroup), no node SystemOOM, no eviction, no controller reap, and no GitHub incident. The failure happens inside the guest VM, which has zero telemetry. Observed across many deaths: they consistently land on the job’s heaviest step (docker build, dependency install, cache restore, tuist cache), never on light steps, which points at the workload starving or OOM-ing the runner agent inside its VM until its heartbeat to GitHub dies. This PR makes that provable instead of inferred.
The change (three parts)
-
vitals.shprobe (infra/linux-runner-image/).run-job.shbackgrounds it just before exec’ing the runner (and thedispatch-poll.shrollout-bridge path does too), so it samples for the job’s lifetime and its last line before a death lands in the Pod logs. Every line is taggedRUNNER_VITALSand is logfmt-shaped, distinguishing the two leading death modes:- Guest OOM:
memory.eventsoom_killincrements, and/or a/dev/kmsgOut of memory: Killed processline. - CPU/memory starvation:
/proc/pressure/{cpu,memory}avg10climbs,mem.currentapproachesmem.max. - Fields: cgroup
memory.current/peak/max/swap,memory.events(oom,oom_kill), guest-widevm.mem.total/availfrom/proc/meminfo, CPU/mem/io PSIavg10, loadavg, and a best-effort/dev/kmsgOOM watcher.
- Guest OOM:
-
Log collection (
infra/helm/k8s-monitoring). The probe writes to stdout, but thealloy-logsDaemonSet that ships Pod logs to Loki did not tolerate the bare-metal runner taint (tuist.dev/runner-tier=bare-metal:NoSchedule), so it never scheduled on the runner nodes and their logs never reached Loki. Without this, the probe would be a no-op in production. Added the toleration to thealloy-logscollector so the DaemonSet lands on the runner nodes too. -
Review-feedback hardening on the probe:
- dind scope: the cgroup
mem.*/oom_killfields are scoped to the runner container, but heavy steps run in the dind sidecar. Added guest-widevm.mem.total/avail(/proc/meminfo, the whole microVM), which together with the already-guest-wide PSI fields and the/dev/kmsgwatcher cover a Docker/buildkit OOM or pressure living in the sidecar. - interval clamp:
TUIST_RUNNER_VITALS_INTERVALis clamped to a positive integer so a 0/empty/non-numeric value cannot spin the loop and flood logs. - CI:
vitals.shadded to thelinux-runner-imageworkflow’sbash -nsyntax check.
- dind scope: the cgroup
How vitals surface
Grafana Explore, Loki datasource:
{namespace="tuist-runners"} |= "RUNNER_VITALS" | logfmt
The dead runner’s pod name equals the GitHub runner_name, so you filter to the exact victim and read its last samples. | logfmt exposes mem_current, oom_kill, cpu_psi_some_avg10, vm_mem_avail_kb, etc. as numeric fields, so it can be graphed and alerted on with no metrics-pipeline change.
Design notes
- Fail-open: every read in
vitals.shis guarded; an unreadable file (PSI disabled in the guest kernel, cgroup v1, restricted/dev/kmsg) degrades to an empty field, never an error. It can never block or fail the runner, and only runs while a job executes, so idle warm Pods stay quiet. - Toleration placement: it sits under the collector’s
controller:block deliberately. A top-leveltolerationson the collector renders tospec.tolerationson the Alloy CR, which the alloy-operator ignores;controller.tolerationsrenders tospec.controller.tolerations, which it applies to the DaemonSet. Verified byhelm template.
Scope and follow-ups
- Linux only. macOS Tart VMs have no log egress (their logs die with the VM, and the
alloy-logsLinux DaemonSet cannot run on macOS nodes), so a macOS probe needs a network sink (a SA-authed/api/internal/runners/vitalsserver endpoint). Tracked separately. - Diagnostic, not a fix. This deliberately changes no runner behavior; it exists so the next disconnect resolves to OOM vs starvation vs genuine external, after which the real fix (shape sizing, dind memory caps, CPU oversubscription) can be chosen with evidence.
- A Grafana panel/alert (e.g.
oom_kill > 0, per-pool memory/PSI) is intentionally deferred until this deploys and the field shape is confirmed against real Loki data.
Validation
bash -nandshellcheck -S warningclean onvitals.sh,run-job.sh,dispatch-poll.sh; interval clamp and/proc/meminfoparser unit-checked.helm templateconfirms the toleration renders atspec.controller.tolerationson thealloy-logsAlloy CR.- The image’s own CI (
linux-runner-image.ymlon pull_request) builds the image.
No GitHub comments yet.