Hive
fix(infra): kill persist pf tables before reload so retries succeed
GitHub issue · Closed
Direct visibility from prod’s kubectl:
``` $ kubectl get scalewayapplesiliconmachine -A | grep mndbc …hz6vv true Ready …m789d true Ready …m7r47 true Ready …sblfl true Ready …xs6fx true Ready …bzd26 Bootstrapping …ffjpx Bootstrapping …hfqmh Bootstrapping …k6kwt Bootstrapping ```
Five hosts recovered after the capi-scaleway image with the pf-flush fix deployed, but four (bzd26, ffjpx, hfqmh, k6kwt) are still stuck — effective warm capacity is 5/9, which is why concurrency on PR #10793 keeps capping out and jobs queue.
Reading bzd26’s machine status:
``` reason: BootstrapFailed message: | pfctl: /etc/pf.anchors/tuist.runners:11: cannot define table vm_sources: Resource busy pfctl: /etc/pf.anchors/tuist.runners:12: cannot define table blocked_dst: Resource busy pfctl: Syntax error in config file: pf rules not loaded ```
Root cause
The flush we added in #10798 IS running on these hosts, but it can’t actually destroy the tables. `pfctl -F all` on an anchor flushes its rules and table entries, but leaves `persist`-flagged tables alive in kernel state. The anchor file declares both tables `persist` (so they survive runtime entry adds), so the next `pfctl -nf` validation tries to redeclare a name the kernel already owns → `Resource busy`.
Fix
-
Drop `persist` from the table declarations. The contents are static (192.168.64.0/22 sources, RFC1918+link-local destinations) — we never `pfctl -T add` at runtime, so there’s no reason to keep tables alive across rule flushes. Without `persist`, every reload cleanly recreates them.
-
Add `pfctl -t -T kill` for both tables before the flush. Handles existing hosts that have the persist-flagged versions baked into kernel state from a previous bootstrap pass. `-T kill` actually destroys the table; `-F all` doesn’t.
After this lands and the operator image rebuilds, the four stuck hosts should clear their bootstrap loop on the next reconcile tick, restoring the full 9-replica warm pool.
How to test
- Operator image rebuild + deploy via the next release.yml run.
- `kubectl get scalewayapplesiliconmachine -A | grep Bootstrapping` returns no rows after one reconcile cycle.
- Effective warm capacity = 9; CI on macOS jobs stops queueing behind 5-Pod cap.
🤖 Generated with Claude Code
No GitHub comments yet.