fix(server): retry the runner owner-label stamp so it stays reliable

Metadata

Source

tuist/tuist #11041

Updated

Jun 24, 2026

Domains

Compute

Details

What & why

Follow-up to #11038, addressing a review finding on it.

#11038 made the tuist.dev/runner-pool-owner label load-bearing for network correctness: the runners-namespace runners-dispatch-egress NetworkPolicy selects idle Pods only (owner-label DoesNotExist), so a claimed Pod drops out of the churning dispatch policy and a server rollout can’t perturb its egress mid-job.

But that label has only ever been stamped best-effort. stamp_owner_labels/3 logged a patch failure and returned :ok, with a comment that the labels were “operational visibility only” — true before #11038, not anymore. A Pod whose label patch fails now stays selected by the idle dispatch policy and is re-exposed to the exact egress perturbation #11038 set out to prevent.

Change

Retry the label patch (@owner_label_stamp_attempts = 3, 100ms backoff) to ride out a transient apiserver blip before giving up. The give-up path stays non-fatal on purpose:

The claim is already won at this point; failing dispatch would strand the job.
A sustained apiserver outage would otherwise block all dispatch. Degrading to “runs without the label” (and only then exposed to the rollout race) is strictly better than dropping every job.

Per-account cap accounting reads from Postgres, not these labels, so the non-fatal fallback doesn’t affect concurrency-limit correctness.

Validation

mix test test/tuist/runners_test.exs — 10 tests, 0 failures. Two new cases through the public dispatch_for_sa/2 path: retry-then-succeed (transient failures) and give-up-but-still-dispatch (bounded at the retry budget, verified via verify_on_exit! call count).
mix compile --warnings-as-errors clean; mix format --check-formatted clean.

Not addressed here (from the same review, deliberately)

P1 (dispatch egress lost before the JIT lands): the owner label is stamped before the JIT is returned, so there’s a theoretical window where a claiming Pod loses dispatch egress before receiving its 200. Low severity: the policy change is async (k8s watch then Cilium) while the 200 returns in ms over an already-established connection, and the dispatch POST has a 10s curl timeout + TCP retransmit to ride a brief BPF reload. It can’t be closed by label timing anyway (the label is always set before the response flushes). Left as an accepted race; folds into the same in-cluster verification as #11038’s Cilium question.
Controller rollout ordering: the review flagged that the drain finalizer only protects pools after the new controller reconciles them. Largely resolved out-of-band: runners-controller@0.6.1 (the drain code) auto-released after #11038 merged and that deploy didn’t change pool topology. The “deploy controller, confirm finalizers, then rename pools” sequencing remains the right discipline for future pool-topology changes.

Opening as a draft for review.

Comments

TA

tuist-atlas[bot] Jun 4, 2026

The fix for runner owner-label stamp reliability (with retry logic) is now available in server@1.205.0. Update to this version to ensure claimed pods reliably drop out of the idle dispatch policy.

TA

tuist-atlas[bot] Jun 5, 2026

The changes from this PR are now available in release xcresult-processor-image@0.11.0. The runner owner-label stamp now retries to stay reliable.