Hive
feat(infra): log terminal runner exit code/reason for lost-communication post-mortems
GitHub issue · Closed
What
When the RunnerPoolReconciler reaps a terminal (Succeeded/Failed) runner Pod, it now logs the runner container’s exitCode + signal + reason first, before deleting the Pod.
Why
We’ve chased “self-hosted runner lost communication with the server” deaths through three different hypotheses (guest OOM, network, controller recycling) and hard evidence refuted each. The recurring problem was always the same: the one datum that classifies the death — how the runner process actually ended — was never captured. The Pod is reaped within seconds (so kubectl logs is gone), and older runner images don’t carry the in-VM vitals probe.
The runner container’s terminated state is exactly that datum, and the controller already holds the terminal Pod in the reconcile that reaps it. Logging it there is durable (controller logs ship to Loki) and image-independent (works for any Pod, including ones predating the vitals probe — which is precisely the case that left a recent investigation with no data).
The fingerprint
exitCode |
reason |
Cause |
|---|---|---|
0 |
Completed | Clean exit — a lost-comms here is the GitHub completion handshake failing, not a death |
137 |
OOMKilled | Host cgroup OOM (VM hit its memory limit) |
137 |
Error | Guest-internal OOM or in-VM SIGKILL (host cgroup fine) |
signal 15 |
— | SIGTERM = pod deletion / drain |
| other | Error | Crash (segfault/abort) |
Debugging flow: lost-comms job → runner_name (= pod name) → grep controller logs in Loki for runner pod terminated pod=… exitCode=… reason=… → classify → cross-reference vitals only if it’s a 137-class. Turns a forensics expedition into a lookup.
Scope and safety
- Read-only: it logs a field on a Pod the reconcile already observes. No behavior change, no new RBAC.
- Captured in the terminal-reap branch (Succeeded/Failed Pods), so it covers the normal single-shot completion path. A Pod hard-deleted out from under the controller before it’s reaped won’t be captured — but that absence is itself diagnostic (VM/node hard-kill vs in-guest process exit).
Validation
go build / go vet / go test ./controllers/... clean. New TestRunnerTerminated covers the 137/Error fingerprint, LastTerminationState fallback, and the no-runner-container / running-runner nil cases.