fix(infra): kill persist pf tables before reload so retries succeed

Metadata

Source

tuist/tuist #10847

Updated

Jun 24, 2026

Domains

Compute

Details

Direct visibility from prod’s kubectl:

``` $ kubectl get scalewayapplesiliconmachine -A | grep mndbc …hz6vv true Ready …m789d true Ready …m7r47 true Ready …sblfl true Ready …xs6fx true Ready …bzd26 Bootstrapping …ffjpx Bootstrapping …hfqmh Bootstrapping …k6kwt Bootstrapping ```

Five hosts recovered after the capi-scaleway image with the pf-flush fix deployed, but four (bzd26, ffjpx, hfqmh, k6kwt) are still stuck — effective warm capacity is 5/9, which is why concurrency on PR #10793 keeps capping out and jobs queue.

Reading bzd26’s machine status:

``` reason: BootstrapFailed message: | pfctl: /etc/pf.anchors/tuist.runners:11: cannot define table vm_sources: Resource busy pfctl: /etc/pf.anchors/tuist.runners:12: cannot define table blocked_dst: Resource busy pfctl: Syntax error in config file: pf rules not loaded ```

Root cause

The flush we added in #10798 IS running on these hosts, but it can’t actually destroy the tables. `pfctl -F all` on an anchor flushes its rules and table entries, but leaves `persist`-flagged tables alive in kernel state. The anchor file declares both tables `persist` (so they survive runtime entry adds), so the next `pfctl -nf` validation tries to redeclare a name the kernel already owns → `Resource busy`.

Fix

Drop `persist` from the table declarations. The contents are static (192.168.64.0/22 sources, RFC1918+link-local destinations) — we never `pfctl -T add` at runtime, so there’s no reason to keep tables alive across rule flushes. Without `persist`, every reload cleanly recreates them.
Add `pfctl -t -T kill` for both tables before the flush. Handles existing hosts that have the persist-flagged versions baked into kernel state from a previous bootstrap pass. `-T kill` actually destroys the table; `-F all` doesn’t.

After this lands and the operator image rebuilds, the four stuck hosts should clear their bootstrap loop on the next reconcile tick, restoring the full 9-replica warm pool.

How to test

Operator image rebuild + deploy via the next release.yml run.
`kubectl get scalewayapplesiliconmachine -A | grep Bootstrapping` returns no rows after one reconcile cycle.
Effective warm capacity = 9; CI on macOS jobs stops queueing behind 5-Pod cap.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.