Hive
fix(infra): stop server deploys from killing in-flight runner jobs
GitHub issue · Closed
Invariant
A runner mid-job survives a server deploy. It doesn’t matter whose job it is — Tuist’s own deploy runner or a customer’s CI — they’re identical pods on the same Linux fleet. Today a production server deploy violates this, which is what makes deploys fail (“the self-hosted runner lost communication with the server”) and would take customer CI jobs down with them as the fleet ramps. This PR closes the two independent ways a deploy can kill an in-flight runner.
Vector 1 — egress NetworkPolicy coupled to the server rollout
The tuist-runners egress NetworkPolicy allowed the dispatch path with podSelector: app.kubernetes.io/component=server. The server Deployment rolls on every deploy (--set server.image.tag=sha-<commit>, even for a CSS-only change), so that rule’s resolved peer set changed on every deploy, and because it lived in the same policy that selects every tuist.dev/runner=true pod, Cilium regenerated the egress datapath of every runner pod — including mid-job ones — dropping the runner’s connection to GitHub.
Why it surfaced now and only on Linux: macOS Tart VMs route real egress outside CNI (gated by pfctl on the host), so this policy is decorative for them. Linux kata pods route GitHub egress through CNI, so it’s their actual gate. The Linux fleet inherited the macOS-era policy and it only became load-bearing once that fleet went live.
Fix: split the policy so the rules a mid-job runner depends on never change on a deploy.
runners-default-deny(all runner pods): ingress-deny + DNS + public-internet egress. Constant selected-pod set and peer set.runners-dispatch-egress(new; idle pods only, viatuist.dev/runner-pool-owner DoesNotExist): the churning server-dispatch rule, now only able to regenerate idle pods (harmless — they retry the poll).
A claimed pod gets the owner label before it execs the runner, so a mid-job pod is selected only by the constant base policy. Side benefit: a running customer job can no longer reach the internal dispatch API.
Vector 2 — ownerReference GC when a pool CR is deleted/renamed
Runner pods carry an owner reference to their RunnerPool CR. A helm upgrade that drops or renames a pool (the shape-keyed pool migration did exactly this, and a rollback crossing that boundary would too) deletes the CR, and Kubernetes GC then cascade-deletes every pod the pool owns — busy or not. This bypasses the reconciler’s busy-pod guard entirely (GC is apiserver-level) and the NetworkPolicy split, so it’s a second, independent kill path.
Fix: a tuist.dev/runner-pool-drain finalizer on the RunnerPool. On deletion the reconciler holds the CR Terminating (GC leaves owned pods alone while the owner still exists), reaps only idle pods, and waits for mid-job pods to finish their single-shot lifecycle before releasing the finalizer — at which point the CR and any remaining terminal pods/SAs GC normally. The autoscaler skips Terminating pools so it doesn’t fight the drain. This covers the helm (background) deletion path; a manual kubectl delete --cascade=foreground on a pool still bypasses the drain (documented).
Impact
- Production (and any cluster running the Linux fleet) can deploy the server without dropping in-flight runner jobs — Tuist’s deploys and customer CI alike.
- No change to idle-pod behavior or to the macOS fleet.
Validation
helm templaterenders cleanly (exit 0) forvalues-managed-productionandvalues-managed-staging; both runner NetworkPolicies render with the expected selectors.go test ./...,go vet ./..., andgofmtall pass for the runners-controller; new testTestReconcile_DrainsPoolOnDeleteWithoutKillingRunningPodcovers the drain (idle reaped, mid-job pod held + survives, finalizer released once the runner exits).- Vector 1’s final causal step (Cilium regenerating the egress program actually dropping the established GitHub connection) is the one piece not provable from the repo. Recommended in-cluster confirmation: start a long job on a Linux runner, run
kubectl -n tuist rollout restart deploy/tuist-tuist-server, and confirm the runner no longer loses GitHub comms when the server pods cycle.
Note
This makes “the deploy job runs on a runner inside the cluster it mutates” a non-issue for correctness (the deploy runner is just another protected mid-job pod). Hosting the deployer on the fleet remains a bootstrap-recovery question (don’t make the tool that recovers the fleet depend on the fleet), which is an operational call tracked separately, not a correctness fix.