Hive
fix(server): reap orphaned owner-stamped runner pods to prevent fleet wedge
GitHub issue · Closed
What changed
Adds Tuist.Runners.Workers.OrphanedStampedPodsWorker, a once-a-minute reconciliation worker that reaps runner Pods which carry an owner stamp (tuist.dev/runner-pool-owner) but have no live Postgres runner_claims row — “stamped in Kubernetes, unclaimed in the database.” Supporting pieces:
Tuist.Kubernetes.Client.delete_runner/3— deletes a Pod and its same-named per-Pod ServiceAccount.Tuist.Runners.Claims.live_pod_names/0— the set ofpod_names backing a live claim.- Registers the worker in
@hosted_only_crons, and declares the newgrace_secondsLogger metadata key. - Grants the server ServiceAccount
deleteonpodsandserviceaccountsin the runners namespace (runners-namespace.yaml). The dispatch path only ever read/patched Pods, sodeletewas never granted; without it the worker would 403 on every reap in production and be inert.
Why it changed (root cause)
This is the fix for a production incident where Linux runner jobs queued for 5+ minutes. The Linux fleet was wedged: ~28 4vcpu-16gb Pods were stuck Running, all owner-stamped (owner=tuist), all started at the same second, ~2 hours old — while GitHub showed only ~1 actually-running job for that label. They pinned ~448 GiB across the two bare-metal nodes (98-99% memory), so the warm pool of the new default shape couldn’t schedule and jobs had no idle runner to claim.
The dispatch path stamps the owner label on a polling Pod the instant it wins a claim (stamp_owner_labels/3), before minting the JIT. That label is what the runner-pool reconciler reads to decide a Pod is busy — it only scales down idle (un-stamped) Pods, never owner-stamped ones.
The problem is an asymmetry: the claim can be released without the Pod ever running a job or reaching a terminal phase, and nothing removes the label or deletes the Pod:
serve_claim/5stamps the label, then the JIT mint /mark_runningfails;release_safely/3deletes the PG claim but not the label.- the server is SIGTERM’d mid-dispatch by a deploy after the stamp but before
mark_running;StaleClaimsWorkerlater reaps the orphanedclaimedrow but not the Pod. (This is what the incident’s deploy did, batch-leaking the whole warm pool at once.) - the runner agent never registers;
OrphanedRunnersWorkerreleases therunningclaim but not the Pod.
In every case the Pod is left Running with a stale owner label. The reconciler can’t scale it down (not idle), the recovery workers only touch the database, and the Pod never exits — it never received a JIT, so it never ran run.sh; it poll-loops forever. On Kata each Pod reserves its full RAM 1:1, so a handful starve a node and a single deploy can wedge the fleet.
Why this solution over the alternatives
The reconciler is blind to claim state, and the recovery workers are DB-only. Rather than patch each release site to also clean up its Pod, this enforces the invariant directly in one place:
Because
Claims.attempt/4INSERTs the claim beforestamp_owner_labels/3runs, an owner-stamped Pod must always have a live claim. A stamped Pod with no claim is therefore unambiguously a leak — there is no legitimate transient state.
So the worker lists owner-stamped Pods, diffs them against live_pod_names/0, and reaps the orphans. This catches every leak vector — including future ones — in a single, testable place, instead of relying on each of three (and counting) release paths to remember to clean up.
Design details:
- Pod + SA together. The reconciler only reaps the sibling SA when it sees a terminal Pod with no deletion in flight; a Pod deleted via the API skips that branch and would orphan its SA.
delete_runner/3mirrors the controller’sreapRunnerand deletes both (they share one name). This is also why the server SA needsdeleteon both resources. - Race-safe. Pods are listed before claims, so a Pod claimed after the snapshot is never considered; a
@grace_seconds(300s) floor on Pod age is belt-and-suspenders against clock skew and keeps brand-new Pods out of scope. A just-completed Pod that slips through is harmless to delete (it is exiting anyway; the delete is idempotent on 404). - Reaped, not un-stamped. Un-stamping a hung zombie would make it look like an available warm Pod the reconciler won’t replace, so the dead Pod would keep blocking the queue. Deleting it lets the reconciler boot a fresh, live replacement.
Impact
Leaked runner Pods now self-heal within ~5 minutes instead of pinning node memory until manually reaped, so a deploy or a GitHub/ClickHouse blip during dispatch can no longer accumulate into a fleet wedge. No customer-facing API change.
Validation
Run locally against the worktree (mix compile, mix format --check-formatted, mix credo all clean):
OrphanedStampedPodsWorkerTest— reaps a stamped/claimless/aged Pod; leaves a stamped Pod with a live claim; leaves a fresh Pod inside the grace window; reaps only the orphans in a mixed batch; no-op on empty; skips the tick on a list failure.ClaimsTest—live_pod_names/0returns bothclaimedandrunningpod names, empty when there are no claims.ClientTest—delete_runner/3deletes Pod + SA, is idempotent on 404, surfaces non-404 errors.mix testfor the three files: 39 tests, 0 failures.- Helm:
helm lintpasses; renderingrunners-namespace.yamlconfirms the server pool-manager Role now carriesserviceaccounts: [get, list, delete]andpods: [get, list, patch, delete].
Follow-up (not in this PR)
The reconciliation worker makes leaks self-healing, so immediate cleanup at the release sites is now optional. A faster-path follow-up could un-stamp the label in release_safely/3 (in-process mint failure, where the Pod is healthy and should rejoin the warm pool without a respawn), shrinking the leak window from ~5 min to ~0 for that vector.
🤖 Generated with Claude Code