Hive Hive
Sign in

feat(infra): surface Linux runner death cause from runner and host sides

GitHub issue · Closed

Metadata
Source
tuist/tuist #11114
Updated
Jun 24, 2026
Domains
Compute
Details

What

Two complementary observability instruments so the next mid-job Linux runner death is self-diagnosing, instead of leaving us to infer from the outside:

  1. runners-controller (podtemplate.go): set ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1 on the runner container. This mirrors the actions/runner _diag log to stdout, which the existing pod-log pipeline ships to Loki, so the runner’s own account of why it exited survives the Pod being reaped.
  2. k8s-monitoring (values.yaml): enable the chart’s first-class feature-node-logs on the alloy-logs DaemonSet, shipping node journald (containerd, kubelet, and kernel) to Loki.

Why

Busy Linux runners die mid-job and GitHub reports “the self-hosted runner lost communication.” Forensics narrowed the termination to a clean exitCode=9, reason=Error, no signal, empty message, and crucially the same signature across both shapes (2vcpu-8gb and 4vcpu-16gb) and across the full load spectrum (one died idle at load 0.6, another saturated at load 7.59).

That rules out, with data, every cause we can see from inside the cluster:

  • Not memory. Guest oom=0; the two bare-metal nodes sit at 24–48% memory used with MemoryPressure=False; zero SystemOOM events.
  • Not a signal kill / OOM-killer. A SIGKILL surfaces as 137 (kata follows the standard 128+signo convention), not a literal 9.
  • Not the controller. Reconciles show target >= observed; busy pods are never drained, only already-terminal pods are reaped.
  • Not the runner software. The v2.334.0 ReturnCode enum spans 0–7, and run-helper.sh folds any unknown code to exit 0, so no software path in the image can emit 9. Our wrapper scripts only exit 0/1.

The only consistent reading: the runner is killed from outside the guest (a host-side kata-shim/qemu/containerd-level microVM teardown), which is exactly the layer our in-guest vitals.sh probe is structurally blind to, and why the runner vanishes with no final stdout.

How this closes the gap

  • The _diag stream gives the runner-side reason (a clean exit reason vs. an abrupt mid-job cutoff itself distinguishes “agent quit” from “VM killed”).
  • Node journald gives the host-side actor. The keep filter intentionally includes ^$ (empty systemd unit) because kernel messages carry no unit — that’s where OOM/segfault/qemu kill lines live, and dropping them would discard the exact evidence we need. Assigned to alloy-logs because that DaemonSet already tolerates the bare-metal runner taint and selects linux nodes, so it runs where the microVMs live; the existing varlog mount already exposes /var/log/journal, so no extra mount is needed.

Validation

  • go build ./... + go test ./internal/podtemplate/ pass; gofmt/go vet clean. New test asserts the env var is on the runner container (both macOS and Linux) and absent on the poller.
  • helm lint + helm template succeed across production/staging/canary overlays. Verified the rendered Alloy CR has the journal source (path=/var/log/journal, max_age=4h, kernel-inclusive keep regex), the kubernetes.io/os: linux nodeSelector, and the runner-tier toleration.

Notes

  • Deploy-time check: the journal source reads /var/log/journal (persistent journald, the k8s convention). First post-deploy verification is that {job="integrations/kubernetes/journal"} returns lines; if empty, the nodes are volatile and we switch the path to /run/log/journal.
  • Cost: both raise Loki ingest (_diag is verbose; node journald on the churning runner nodes is chatty). They are diagnostic instruments: once the teardown cause is named, ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT can return to 0 and node-logs can be tightened or disabled.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 6, 2026

The changes from this pull request are now available in runners-controller@0.11.0. Update to this version to surface Linux runner death cause from runner and host sides.