Hive Hive
Sign in

fix(server): stop OrphanedStampedPodsWorker from killing live runner pods

GitHub issue · Closed

Metadata
Source
tuist/tuist #11133
Updated
Jun 24, 2026
Domains
Compute
Details

What

Guard OrphanedStampedPodsWorker against force-deleting pods that are actively executing a job. Adds a runner_started?/1 check on the Pod’s runner container state and refuses to reap when the runner has reached started=true or has a state.terminated / lastState.terminated.

Why

The worker’s safety claim has been the docstring invariant Claims.attempt/4 INSERTs the claim before stamp_owner_labels/3, so an owner-stamped Pod should always have a matching claim.” The invariant breaks whenever anything releases the PG claim after dispatch already delivered the JIT and the runner started executing a job. release_safely/3 is the documented trigger (mint blip, mark_running PG error, record_running_safe CH hiccup); any future code path that touches Claims.release/Claims.complete while the runner is mid-job has the same effect. The worker’s next 1-minute tick then sees stamp + no-claim + age > 5 min and force-deletes a live customer Pod. From GitHub’s side the job surfaces as “the self-hosted runner lost communication with the server.”

This was the unified mechanism behind the runner-disconnect class of incidents we’ve been chasing across multiple recent runs. Confirmed in production from Loki:

21:30:14Z runner registers, GitHub job starts on pod ...48379681
21:31:00Z Loki: pod=...48379681 "runners: reaped orphaned stamped pod"
21:35:06Z pod NotFound; GitHub still shows the job in_progress; later → "lost communication"

The controller had no termination record for the pod and no k8s Killing event was emitted — both consistent with a server-side force-delete bypassing the normal pod lifecycle, which is exactly what K8sClient.delete_runner/2 does. There is no other code path that can delete a runner Pod silently.

The guard

Add runner_started?/1 and AND it into orphaned?/3:

defp runner_started?(pod) do
pod
|> get_in([\"status\", \"containerStatuses\"])
|> List.wrap()
|> Enum.find(&(&1[\"name\"] == \"runner\"))
|> case do
nil -> false
status ->
Map.get(status, \"started\") == true or
not is_nil(get_in(status, [\"state\", \"terminated\"])) or
not is_nil(get_in(status, [\"lastState\", \"terminated\"]))
end
end

In the Linux token-isolation shape the runner container is gated behind the poller init container, so a started runner is unambiguous proof that the poller staged a JIT and the runner is or was executing a customer job. The original wedge signature this worker was built for — poller poll-loops forever without ever claiming a JIT — still trips the reap because the runner container never leaves waiting. Pods with no container statuses yet (very early lifecycle) still reap after the existing 5-min grace window.

Deliberately conservative: a transient pod-status shape we don’t recognise leaves the Pod for one more reconcile cycle (the Pool reconciler reaps terminated Pods anyway) rather than risking another silent live-build delete. The class of failure this fixes is customer-visible and hard to reproduce; the worst case of the guard is a slightly delayed leak cleanup, which the Pool reconciler also handles.

Validation

Added test coverage:

  • runner container state.running + started=true → guard fires, no delete
  • runner container state.terminated (started flips back to false on termination) → guard fires
  • runner container state.waiting (the original wedge) → guard does NOT fire, reap happens
  • No containerStatuses at all (very early lifecycle) → reap (existing grace covers brand-new pods)
  • Mixed batch with all three shapes → only the wedged one is reaped

All 11 tests pass; format check clean.

Followups

  • The companion question — which code path actually released the PG claim ~46 s into the job — is still open. The strong candidate is release_safely/3 from serve_claim/5’s with-else triggered by a tail-latency error after the JIT was already returned. Worth digging into separately. This PR closes the silent-delete vulnerability regardless of which trigger fires.
  • macOS Tart pods don’t have a separate runner container that’s gated on a poller; the single container is started=true from the moment dispatch-poll begins. The guard therefore makes the worker effectively a no-op for macOS pods. That is acceptable: the wedge this worker was built for is Linux-Kata-specific (the docstring notes “on Kata each Pod reserves its full RAM 1:1”), and we have not observed an analogous wedge on macOS.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 9, 2026

The fix to stop OrphanedStampedPodsWorker from killing live runner pods is now available in xcresult-processor-image@0.12.3. Update to this version to use it.

TA
tuist-atlas[bot] Jun 9, 2026

The fix to stop OrphanedStampedPodsWorker from killing live runner pods is available in server@1.207.2. Update to this version to apply the fix.