Hive
feat(infra): re-emit abnormal runner death logs before the reap
GitHub issue · Closed
Make mid-job Linux runner deaths diagnosable.
What changed
When a runner Pod ends abnormally, the pod-lifecycle reconciler now reads the runner container’s log tail via pods/log and re-emits it to the controller’s own stdout ("runner death log captured", tagged with pod/pool/endedAt), so the trail survives the reap that deletes the Pod.
infra/runners-controller/controllers/pod_lifecycle_controller.go— capture logic (captureDeathLog/fetchRunnerLog/abnormalEnd), bounded to 200 lines / 64 KB, deduped via async.Map, best-effort (never fails or requeues the reconcile).infra/runners-controller/cmd/manager/main.go— wire a typed clientset (the cached controller-runtime client cannot serve the logs subresource).infra/helm/tuist/templates/runners-controller-deployment.yaml— grantpods/log: geton the controller Role (namespaced totuist-runners).- Tests +
AGENTS.md.
Why
A tuist-linux-large runner died mid-job and surfaced to GitHub as “self-hosted runner lost communication”. Investigation showed the in-guest RUNNER_VITALS probe (the thing built to make these deaths diagnosable) captured nothing: across 17M+ vitals lines in 24h, not one showed distress, and that dead Pod shipped zero vitals despite running for 12 minutes.
Root cause: the trail lives only on the runner container’s stdout, which flows to the kubelet container log and is GC’d the instant the reap deletes the Pod. alloy does not reliably win that race on a churning node, and the final abrupt-kill sample never flushes. So the trail dies with the Pod.
This is distinct from the earlier host-OOM/SIGBUS incident. Host-side telemetry (node-exporter + journald, shipped in #11324 / #11114) confirms this death was not host memory exhaustion: the node peaked at ~59% RAM with node_vmstat_oom_kill at 0 and no QEMU/kata fault in the journal. The likely cause was transient in-guest CPU/IO contention dropping the GitHub heartbeat — exactly the kind of thing the vitals trail should have recorded.
Why this approach over the obvious alternative
The obvious fix is to write vitals to a host-backed path so they outlive the Pod. But the only place to write from is the runner container, which runs untrusted workflow code and whose kata-qemu microVM is the isolation boundary. A writable hostPath there would let customer code fill the shared node’s disk and inject log content. Rejected.
Capturing the trail from the controller keeps it entirely off the untrusted boundary: the controller has a durable, long-lived stdout, and PodLifecycleReconciler already fires on the terminal transition ~60s before the RunnerPoolReconciler reap, so the kubelet log still exists to pull.
Capture is gated on the runner’s exit code, so a workflow that fails its own tests (the runner still exits 0) is skipped — this captures runner infrastructure deaths, not job outcomes, keeping the volume low.
Limitation
If the guest console never wrote to the kubelet log at all (a kata-console failure), the fetch returns an empty log too. The hard guarantee against that is host-side cgroup sampling (a DaemonSet sampling each Pod’s QEMU/cgroup from the node, independent of the guest), deferred until we observe blind deaths persisting after this ships.
User/developer impact
No runtime behavior change for jobs. Operators gain a durable forensic record for infrastructure deaths. Needs a runners-controller image release to land; the RBAC rides the same chart; only active where the reconciler is already enabled (sessionsURL set, already true in prod).
How to test locally
cd infra/runners-controller && go test ./... && go vet ./...(new cases:TestAbnormalEnd,TestPodLifecycle_CapturesDeathLogOnAbnormalExit,TestPodLifecycle_NoCaptureOnCleanExit,TestPodLifecycle_DeduplicatesDeathLogCapture).- After deploy, the next abnormal death is queryable in Loki (
grafanacloud-logs):{pod=~".*runners-controller.*"} |= "runner death log captured".
🤖 Generated with Claude Code
No GitHub comments yet.