Hive Hive
Sign in

feat(infra): log terminal runner exit code/reason for lost-communication post-mortems

GitHub issue · Closed

Metadata
Source
tuist/tuist #11109
Updated
Jun 24, 2026
Domains
Compute
Details

What

When the RunnerPoolReconciler reaps a terminal (Succeeded/Failed) runner Pod, it now logs the runner container’s exitCode + signal + reason first, before deleting the Pod.

Why

We’ve chased “self-hosted runner lost communication with the server” deaths through three different hypotheses (guest OOM, network, controller recycling) and hard evidence refuted each. The recurring problem was always the same: the one datum that classifies the death — how the runner process actually ended — was never captured. The Pod is reaped within seconds (so kubectl logs is gone), and older runner images don’t carry the in-VM vitals probe.

The runner container’s terminated state is exactly that datum, and the controller already holds the terminal Pod in the reconcile that reaps it. Logging it there is durable (controller logs ship to Loki) and image-independent (works for any Pod, including ones predating the vitals probe — which is precisely the case that left a recent investigation with no data).

The fingerprint

exitCode reason Cause
0 Completed Clean exit — a lost-comms here is the GitHub completion handshake failing, not a death
137 OOMKilled Host cgroup OOM (VM hit its memory limit)
137 Error Guest-internal OOM or in-VM SIGKILL (host cgroup fine)
signal 15 SIGTERM = pod deletion / drain
other Error Crash (segfault/abort)

Debugging flow: lost-comms job → runner_name (= pod name) → grep controller logs in Loki for runner pod terminated pod=… exitCode=… reason=… → classify → cross-reference vitals only if it’s a 137-class. Turns a forensics expedition into a lookup.

Scope and safety

  • Read-only: it logs a field on a Pod the reconcile already observes. No behavior change, no new RBAC.
  • Captured in the terminal-reap branch (Succeeded/Failed Pods), so it covers the normal single-shot completion path. A Pod hard-deleted out from under the controller before it’s reaped won’t be captured — but that absence is itself diagnostic (VM/node hard-kill vs in-guest process exit).

Validation

go build / go vet / go test ./controllers/... clean. New TestRunnerTerminated covers the 137/Error fingerprint, LastTerminationState fallback, and the no-runner-container / running-runner nil cases.

Comments
TA
tuist-atlas[bot] Jun 6, 2026

The logging for terminal runner exit codes and reasons is now available in release runners-controller@0.10.0. Update to this version to enable post-mortem debugging of lost-communication issues via the controller logs.