Hive
feat(infra): surface Linux runner death cause from runner and host sides
GitHub issue · Closed
What
Two complementary observability instruments so the next mid-job Linux runner death is self-diagnosing, instead of leaving us to infer from the outside:
- runners-controller (
podtemplate.go): setACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1on the runner container. This mirrors the actions/runner_diaglog to stdout, which the existing pod-log pipeline ships to Loki, so the runner’s own account of why it exited survives the Pod being reaped. - k8s-monitoring (
values.yaml): enable the chart’s first-classfeature-node-logson thealloy-logsDaemonSet, shipping node journald (containerd,kubelet, and kernel) to Loki.
Why
Busy Linux runners die mid-job and GitHub reports “the self-hosted runner lost communication.” Forensics narrowed the termination to a clean exitCode=9, reason=Error, no signal, empty message, and crucially the same signature across both shapes (2vcpu-8gb and 4vcpu-16gb) and across the full load spectrum (one died idle at load 0.6, another saturated at load 7.59).
That rules out, with data, every cause we can see from inside the cluster:
- Not memory. Guest
oom=0; the two bare-metal nodes sit at 24–48% memory used withMemoryPressure=False; zeroSystemOOMevents. - Not a signal kill / OOM-killer. A
SIGKILLsurfaces as137(kata follows the standard128+signoconvention), not a literal9. - Not the controller. Reconciles show
target >= observed; busy pods are never drained, only already-terminal pods are reaped. - Not the runner software. The v2.334.0
ReturnCodeenum spans0–7, andrun-helper.shfolds any unknown code toexit 0, so no software path in the image can emit9. Our wrapper scripts only exit0/1.
The only consistent reading: the runner is killed from outside the guest (a host-side kata-shim/qemu/containerd-level microVM teardown), which is exactly the layer our in-guest vitals.sh probe is structurally blind to, and why the runner vanishes with no final stdout.
How this closes the gap
- The
_diagstream gives the runner-side reason (a clean exit reason vs. an abrupt mid-job cutoff itself distinguishes “agent quit” from “VM killed”). - Node journald gives the host-side actor. The keep filter intentionally includes
^$(empty systemd unit) because kernel messages carry no unit — that’s where OOM/segfault/qemukill lines live, and dropping them would discard the exact evidence we need. Assigned toalloy-logsbecause that DaemonSet already tolerates the bare-metal runner taint and selects linux nodes, so it runs where the microVMs live; the existingvarlogmount already exposes/var/log/journal, so no extra mount is needed.
Validation
go build ./...+go test ./internal/podtemplate/pass;gofmt/go vetclean. New test asserts the env var is on the runner container (both macOS and Linux) and absent on the poller.helm lint+helm templatesucceed across production/staging/canary overlays. Verified the rendered Alloy CR has the journal source (path=/var/log/journal,max_age=4h, kernel-inclusive keep regex), thekubernetes.io/os: linuxnodeSelector, and the runner-tier toleration.
Notes
- Deploy-time check: the journal source reads
/var/log/journal(persistent journald, the k8s convention). First post-deploy verification is that{job="integrations/kubernetes/journal"}returns lines; if empty, the nodes are volatile and we switch the path to/run/log/journal. - Cost: both raise Loki ingest (
_diagis verbose; node journald on the churning runner nodes is chatty). They are diagnostic instruments: once the teardown cause is named,ACTIONS_RUNNER_PRINT_LOG_TO_STDOUTcan return to0and node-logs can be tightened or disabled.
🤖 Generated with Claude Code