Hive Hive
Sign in

fix(infra, capi-scaleway): unstick runners-fleet (pf tables + manual-cleanup race + prep-script simplification)

GitHub issue · Closed

Metadata
Source
tuist/tuist #10858
Updated
Jun 24, 2026
Domains
Compute
Details

Summary

Three related fixes uncovered while diagnosing a stuck production runners-fleet on 2026-05-19.

1. installVMEgressFirewall couldn’t recover hosts with persist’d pf tables

Three CRs (bzd26, ffjpx, k6kwt) all bound to the same host had been failing bootstrap for 5 days with:

/etc/pf.anchors/tuist.runners:11: cannot define table vm_sources: Resource busy
/etc/pf.anchors/tuist.runners:12: cannot define table blocked_dst: Resource busy
pfctl: Syntax error in config file: pf rules not loaded

pfctl(8) defines -T kill as “Kill a table” — i.e. destroy it from kernel state — and that’s the only command that removes a persist’d table. -F all / -F Tables explicitly preserve persist tables (they only flush addresses).

The original script’s -T kill invocation was likely failing because older versions of the script had at various times placed these tables in either anchor scope or the main ruleset, and the cleanup only operated on one of them with the other silently swallowed by 2>/dev/null || true. Fix: kill at both scopes, then load an empty anchor as belt-and-suspenders for stale anchor-level rule references.

2. Operator-driven CR cleanup raced AdoptByPrefix

When recovering from a duplicate-claim state (or hand-rolling a CR off a shared host), the operator clears status.serverID so reconcileDelete skips Scaleway release, then deletes the CR. There’s an unavoidable race today: between the patch and the delete, the reconcile loop can see the empty serverID, run AdoptByPrefix against the pool, rename a tuist-pool-* host to the doomed CR’s claim name, and persist the bind. Then reconcileDelete tries to DeleteServer against the freshly-claimed pool host and gets stuck on the Apple 24h licensing floor.

Switch the pause gate to honor the standard CAPI cluster.x-k8s.io/paused annotation on the infra CR itself (in addition to Cluster.Spec.Paused). Operators set the annotation first, then patch, then delete; the annotation latches reconcileNormal off until DeletionTimestamp lands. reconcileDelete is intentionally above the pause gate so teardown still progresses on paused CRs.

Operator recipe captured in AGENTS.md.

3. prepare-fleet-host dropped sshpass dependency

Discovered while recovering pool hosts whose CAPI bootstrap was failing on sudo: the script’s sshpass fallback path is the source of every operator friction in this flow (PTY exhaustion, heredoc-quoting bugs, silent EXIT=5, zsh/bash read -rsp portability). It’s also unnecessary — Scaleway Apple Silicon hosts auto-inject all project-level SSH keys into ~/.ssh/scw_authorized_keys at first boot, and sshd reads both authorized_keys and scw_authorized_keys. Once EnsureFleetSSHKey has registered the fleet pubkey at the project level, every newly-ordered pool host accepts the fleet private key out of the box.

Drop the sshpass path. The script now uses SSH-key transport exclusively, installs /etc/sudoers.d/m1-nopasswd (one-time sudo -S use of the password), then installs /etc/kcpassword + autoLoginUser via the now-passwordless sudo. On SSH-key probe failure, errors out with a clear recovery path (VNC + manual paste into scw_authorized_keys, or reinstall host).

Test plan

  • go test ./controllers/... — adds TestReconcile_PausedAnnotationSkipsAdoption covering the new pause gate.
  • Script syntax validated (bash -n).
  • Pause-annotation cleanup recipe used in production on 2026-05-19 to recover 4 duplicate CRs; worked as designed.
  • Simplified prepare-fleet-host used in production on 2026-05-19 to recover h5dkz and njpkv (both successfully bootstrapped after the operator-side sudoers install).
  • (blocking draft) Build operator image with this change, deploy to production cluster, observe tuist-tuist-runners-fleet-mndbc-k6kwt flip from BootstrappedCondition=False to True (the host with stuck persist tables — proves fix #1 against the real failure).

Context

Full diagnosis snapshot in /tmp/capi-rescue/diagnosis.md (off-tree); cluster recovery walked from 3/9 Ready hosts to 8/9 over the course of the day, with k6kwt blocked on this PR being deployed.

Comments

No GitHub comments yet.