Hive
fix(server): retry the runner owner-label stamp so it stays reliable
GitHub issue · Closed
What & why
Follow-up to #11038, addressing a review finding on it.
#11038 made the tuist.dev/runner-pool-owner label load-bearing for network correctness: the runners-namespace runners-dispatch-egress NetworkPolicy selects idle Pods only (owner-label DoesNotExist), so a claimed Pod drops out of the churning dispatch policy and a server rollout can’t perturb its egress mid-job.
But that label has only ever been stamped best-effort. stamp_owner_labels/3 logged a patch failure and returned :ok, with a comment that the labels were “operational visibility only” — true before #11038, not anymore. A Pod whose label patch fails now stays selected by the idle dispatch policy and is re-exposed to the exact egress perturbation #11038 set out to prevent.
Change
Retry the label patch (@owner_label_stamp_attempts = 3, 100ms backoff) to ride out a transient apiserver blip before giving up. The give-up path stays non-fatal on purpose:
- The claim is already won at this point; failing dispatch would strand the job.
- A sustained apiserver outage would otherwise block all dispatch. Degrading to “runs without the label” (and only then exposed to the rollout race) is strictly better than dropping every job.
Per-account cap accounting reads from Postgres, not these labels, so the non-fatal fallback doesn’t affect concurrency-limit correctness.
Validation
mix test test/tuist/runners_test.exs— 10 tests, 0 failures. Two new cases through the publicdispatch_for_sa/2path: retry-then-succeed (transient failures) and give-up-but-still-dispatch (bounded at the retry budget, verified viaverify_on_exit!call count).mix compile --warnings-as-errorsclean;mix format --check-formattedclean.
Not addressed here (from the same review, deliberately)
- P1 (dispatch egress lost before the JIT lands): the owner label is stamped before the JIT is returned, so there’s a theoretical window where a claiming Pod loses dispatch egress before receiving its 200. Low severity: the policy change is async (k8s watch then Cilium) while the 200 returns in ms over an already-established connection, and the dispatch POST has a 10s curl timeout + TCP retransmit to ride a brief BPF reload. It can’t be closed by label timing anyway (the label is always set before the response flushes). Left as an accepted race; folds into the same in-cluster verification as #11038’s Cilium question.
- Controller rollout ordering: the review flagged that the drain finalizer only protects pools after the new controller reconciles them. Largely resolved out-of-band:
runners-controller@0.6.1(the drain code) auto-released after #11038 merged and that deploy didn’t change pool topology. The “deploy controller, confirm finalizers, then rename pools” sequencing remains the right discipline for future pool-topology changes.
Opening as a draft for review.