Hive
fix(server): stop OrphanedStampedPodsWorker from killing live runner pods
GitHub issue · Closed
What
Guard OrphanedStampedPodsWorker against force-deleting pods that are actively executing a job. Adds a runner_started?/1 check on the Pod’s runner container state and refuses to reap when the runner has reached started=true or has a state.terminated / lastState.terminated.
Why
The worker’s safety claim has been the docstring invariant ”Claims.attempt/4 INSERTs the claim before stamp_owner_labels/3, so an owner-stamped Pod should always have a matching claim.” The invariant breaks whenever anything releases the PG claim after dispatch already delivered the JIT and the runner started executing a job. release_safely/3 is the documented trigger (mint blip, mark_running PG error, record_running_safe CH hiccup); any future code path that touches Claims.release/Claims.complete while the runner is mid-job has the same effect. The worker’s next 1-minute tick then sees stamp + no-claim + age > 5 min and force-deletes a live customer Pod. From GitHub’s side the job surfaces as “the self-hosted runner lost communication with the server.”
This was the unified mechanism behind the runner-disconnect class of incidents we’ve been chasing across multiple recent runs. Confirmed in production from Loki:
21:30:14Z runner registers, GitHub job starts on pod ...48379681
21:31:00Z Loki: pod=...48379681 "runners: reaped orphaned stamped pod"
21:35:06Z pod NotFound; GitHub still shows the job in_progress; later → "lost communication"
The controller had no termination record for the pod and no k8s Killing event was emitted — both consistent with a server-side force-delete bypassing the normal pod lifecycle, which is exactly what K8sClient.delete_runner/2 does. There is no other code path that can delete a runner Pod silently.
The guard
Add runner_started?/1 and AND it into orphaned?/3:
defp runner_started?(pod) do
pod
|> get_in([\"status\", \"containerStatuses\"])
|> List.wrap()
|> Enum.find(&(&1[\"name\"] == \"runner\"))
|> case do
nil -> false
status ->
Map.get(status, \"started\") == true or
not is_nil(get_in(status, [\"state\", \"terminated\"])) or
not is_nil(get_in(status, [\"lastState\", \"terminated\"]))
end
end
In the Linux token-isolation shape the runner container is gated behind the poller init container, so a started runner is unambiguous proof that the poller staged a JIT and the runner is or was executing a customer job. The original wedge signature this worker was built for — poller poll-loops forever without ever claiming a JIT — still trips the reap because the runner container never leaves waiting. Pods with no container statuses yet (very early lifecycle) still reap after the existing 5-min grace window.
Deliberately conservative: a transient pod-status shape we don’t recognise leaves the Pod for one more reconcile cycle (the Pool reconciler reaps terminated Pods anyway) rather than risking another silent live-build delete. The class of failure this fixes is customer-visible and hard to reproduce; the worst case of the guard is a slightly delayed leak cleanup, which the Pool reconciler also handles.
Validation
Added test coverage:
runnercontainerstate.running+started=true→ guard fires, no deleterunnercontainerstate.terminated(started flips back to false on termination) → guard firesrunnercontainerstate.waiting(the original wedge) → guard does NOT fire, reap happens- No
containerStatusesat all (very early lifecycle) → reap (existing grace covers brand-new pods) - Mixed batch with all three shapes → only the wedged one is reaped
All 11 tests pass; format check clean.
Followups
- The companion question — which code path actually released the PG claim ~46 s into the job — is still open. The strong candidate is
release_safely/3fromserve_claim/5’swith-else triggered by a tail-latency error after the JIT was already returned. Worth digging into separately. This PR closes the silent-delete vulnerability regardless of which trigger fires. - macOS Tart pods don’t have a separate
runnercontainer that’s gated on a poller; the single container isstarted=truefrom the moment dispatch-poll begins. The guard therefore makes the worker effectively a no-op for macOS pods. That is acceptable: the wedge this worker was built for is Linux-Kata-specific (the docstring notes “on Kata each Pod reserves its full RAM 1:1”), and we have not observed an analogous wedge on macOS.
🤖 Generated with Claude Code