Hive Hive
Sign in

fix(server): reap orphaned owner-stamped runner pods to prevent fleet wedge

GitHub issue · Closed

Metadata
Source
tuist/tuist #11060
Updated
Jun 24, 2026
Domains
Compute
Details

What changed

Adds Tuist.Runners.Workers.OrphanedStampedPodsWorker, a once-a-minute reconciliation worker that reaps runner Pods which carry an owner stamp (tuist.dev/runner-pool-owner) but have no live Postgres runner_claims row — “stamped in Kubernetes, unclaimed in the database.” Supporting pieces:

  • Tuist.Kubernetes.Client.delete_runner/3 — deletes a Pod and its same-named per-Pod ServiceAccount.
  • Tuist.Runners.Claims.live_pod_names/0 — the set of pod_names backing a live claim.
  • Registers the worker in @hosted_only_crons, and declares the new grace_seconds Logger metadata key.
  • Grants the server ServiceAccount delete on pods and serviceaccounts in the runners namespace (runners-namespace.yaml). The dispatch path only ever read/patched Pods, so delete was never granted; without it the worker would 403 on every reap in production and be inert.

Why it changed (root cause)

This is the fix for a production incident where Linux runner jobs queued for 5+ minutes. The Linux fleet was wedged: ~28 4vcpu-16gb Pods were stuck Running, all owner-stamped (owner=tuist), all started at the same second, ~2 hours old — while GitHub showed only ~1 actually-running job for that label. They pinned ~448 GiB across the two bare-metal nodes (98-99% memory), so the warm pool of the new default shape couldn’t schedule and jobs had no idle runner to claim.

The dispatch path stamps the owner label on a polling Pod the instant it wins a claim (stamp_owner_labels/3), before minting the JIT. That label is what the runner-pool reconciler reads to decide a Pod is busy — it only scales down idle (un-stamped) Pods, never owner-stamped ones.

The problem is an asymmetry: the claim can be released without the Pod ever running a job or reaching a terminal phase, and nothing removes the label or deletes the Pod:

  • serve_claim/5 stamps the label, then the JIT mint / mark_running fails; release_safely/3 deletes the PG claim but not the label.
  • the server is SIGTERM’d mid-dispatch by a deploy after the stamp but before mark_running; StaleClaimsWorker later reaps the orphaned claimed row but not the Pod. (This is what the incident’s deploy did, batch-leaking the whole warm pool at once.)
  • the runner agent never registers; OrphanedRunnersWorker releases the running claim but not the Pod.

In every case the Pod is left Running with a stale owner label. The reconciler can’t scale it down (not idle), the recovery workers only touch the database, and the Pod never exits — it never received a JIT, so it never ran run.sh; it poll-loops forever. On Kata each Pod reserves its full RAM 1:1, so a handful starve a node and a single deploy can wedge the fleet.

Why this solution over the alternatives

The reconciler is blind to claim state, and the recovery workers are DB-only. Rather than patch each release site to also clean up its Pod, this enforces the invariant directly in one place:

Because Claims.attempt/4 INSERTs the claim before stamp_owner_labels/3 runs, an owner-stamped Pod must always have a live claim. A stamped Pod with no claim is therefore unambiguously a leak — there is no legitimate transient state.

So the worker lists owner-stamped Pods, diffs them against live_pod_names/0, and reaps the orphans. This catches every leak vector — including future ones — in a single, testable place, instead of relying on each of three (and counting) release paths to remember to clean up.

Design details:

  • Pod + SA together. The reconciler only reaps the sibling SA when it sees a terminal Pod with no deletion in flight; a Pod deleted via the API skips that branch and would orphan its SA. delete_runner/3 mirrors the controller’s reapRunner and deletes both (they share one name). This is also why the server SA needs delete on both resources.
  • Race-safe. Pods are listed before claims, so a Pod claimed after the snapshot is never considered; a @grace_seconds (300s) floor on Pod age is belt-and-suspenders against clock skew and keeps brand-new Pods out of scope. A just-completed Pod that slips through is harmless to delete (it is exiting anyway; the delete is idempotent on 404).
  • Reaped, not un-stamped. Un-stamping a hung zombie would make it look like an available warm Pod the reconciler won’t replace, so the dead Pod would keep blocking the queue. Deleting it lets the reconciler boot a fresh, live replacement.

Impact

Leaked runner Pods now self-heal within ~5 minutes instead of pinning node memory until manually reaped, so a deploy or a GitHub/ClickHouse blip during dispatch can no longer accumulate into a fleet wedge. No customer-facing API change.

Validation

Run locally against the worktree (mix compile, mix format --check-formatted, mix credo all clean):

  • OrphanedStampedPodsWorkerTest — reaps a stamped/claimless/aged Pod; leaves a stamped Pod with a live claim; leaves a fresh Pod inside the grace window; reaps only the orphans in a mixed batch; no-op on empty; skips the tick on a list failure.
  • ClaimsTestlive_pod_names/0 returns both claimed and running pod names, empty when there are no claims.
  • ClientTestdelete_runner/3 deletes Pod + SA, is idempotent on 404, surfaces non-404 errors.
  • mix test for the three files: 39 tests, 0 failures.
  • Helm: helm lint passes; rendering runners-namespace.yaml confirms the server pool-manager Role now carries serviceaccounts: [get, list, delete] and pods: [get, list, patch, delete].

Follow-up (not in this PR)

The reconciliation worker makes leaks self-healing, so immediate cleanup at the release sites is now optional. A faster-path follow-up could un-stamp the label in release_safely/3 (in-process mint failure, where the Pod is healthy and should rejoin the warm pool without a respawn), shrinking the leak window from ~5 min to ~0 for that vector.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 4, 2026

The fix for reaping orphaned owner-stamped runner pods is now available in server@1.205.1. Update to this version to prevent fleet wedges caused by leaked runner pods.

TA
tuist-atlas[bot] Jun 5, 2026

The changes from this PR are now available in release xcresult-processor-image@0.11.0. Orphaned owner-stamped runner pods are now reaped to prevent fleet wedge issues.