Hive Hive
Sign in

fix(infra): load pf anchor directly (macOS firewall install fails reloading whole pf.conf)

GitHub issue · Closed

Metadata
Source
tuist/tuist #11425
Updated
Jun 24, 2026
Domains
Compute
Details

What

installVMEgressFirewall activated the egress filter by re-running the whole pfctl -f /etc/pf.conf. On a live host that fails. Load the tuist.runners anchor directly instead: pfctl -a tuist.runners -f /etc/pf.anchors/tuist.runners.

Root cause (validated live on a runner host)

Re-running pfctl -f /etc/pf.conf on a live host (pf already enabled) collides with macOS’s system-managed main ruleset. pfctl warns it “could result in flushing of rules present in the main ruleset added by the system at startup” and aborts when the embedded load anchor re-defines the vm_sources/blocked_dst tables:

/etc/pf.anchors/tuist.runners:11: cannot define table vm_sources: Resource busy
pfctl: Syntax error in config file: pf rules not loaded

set -euo pipefail then aborts the install. The pfctl -nf validate makes it worse: load anchor runs even under -n, so the validate itself loads the tables the subsequent -f then can’t redefine. (My first commit reordered the table reset; that was a red herring, hence the second commit replacing the whole approach.)

Impact

On the drift-update path this retried to the cap and transitioned every runners-fleet machine to failureReason=TartKubeletUpdateExceededRetries (terminal), which gates the drift loop off entirely. The fleet’s host-config update path was dead, so no firewall/NAT change reached the hosts, including #11423 — which is why deploying #11423 alone didn’t fix the cache hang.

Fix

Load the anchor directly. An anchor-scoped load applies the rules atomically without touching the system main ruleset, so it succeeds at bootstrap AND on every reconcile. Boot persistence is unchanged: the pfctl-runners LaunchDaemon loads the whole config at boot, when pf is freshly disabled and there is nothing to collide with.

Validation

  • go build / vet / gofmt / package tests green.
  • Validated live on production runner hosts. The direct anchor load restored the filter + loaded the corrected NAT on all 9 runners-fleet hosts, where pfctl -f /etc/pf.conf fails. kura_public_request_latency_seconds_count for scw-fr-par went from 0 (all day) to 300+/25min — runners are reaching the private cache.

Related / follow-up

  • Stacks on #11423 (NAT interface detection) and #11418 (EM pn0 persistence).
  • The 9 hosts were hand-fixed to unblock CI now; for durability across reboots/re-provisions this must ship, then the terminal CRs be cleared so the drift re-pushes the now-succeeding install.
  • Make the drift key on a host-config hash, not just the tart-kubelet binary SHA, so firewall/script fixes propagate on deploy without a force-patch.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.