Hive
fix(infra): NAT macOS runner VM cache traffic onto the Private Network
GitHub issue · Closed
What
Fix the macOS runner host’s VM-egress NAT so Tart VM cache traffic is masqueraded onto the Private Network. tuist-pf-vmnat now detects the PN VLAN interface by finding the vlan* device that holds an inet address, instead of route -n get <PN-network-base>.
Root cause
macOS runners reach the per-account Kura cache over a PN NodePort (172.16.0.0/22:30000-32767). The Tart VM (192.168.64.0/22) egress is supposed to be source-NAT’d to the host’s PN address so the kura node can reply and the NodePort NetworkPolicy (which admits only the PN CIDR) lets it in.
tuist-pf-vmnat derived the NAT interface with:
PNIF=$(route -n get "${PNCIDR%%/*}" | awk '/interface/{print $2}') # route -n get 172.16.0.0
case "$PNIF" in vlan*) ... ;; esac
${PNCIDR%%/*} is the network base address (172.16.0.0). macOS resolves a network base address to the parent physical NIC (en0), not the vlan that owns the subnet. So PNIF=en0, the case vlan* gate never matched, the PN NAT rule was never added, RULES stayed empty, and the script exited loading nothing into com.apple/tuist.vmnat.
With no NAT, VM cache packets reached the kura node with their raw 192.168.64.2 source. That’s fatal twice over: every Tart VM is 192.168.64.2 (so the node can’t route a reply, and couldn’t disambiguate hosts if it tried), and the source isn’t in the PN CIDR the NetworkPolicy allows, so it’s dropped at ingress. The TCP handshake never completed, the runner retransmitted SYNs, and builds hung indefinitely at Fetching remote binaries via module cache. Hold tight....
How it was diagnosed
On the production kura node, pn0 was up with 172.16.0.2, the node-port served locally (200 in 1ms), the PN bridged, and the NetworkPolicy/endpoints were correct — yet Prometheus showed kura-tuist-scw-fr-par-0 with heavy internal traffic but zero kura_public_request_latency_seconds_count. tcpdump -ni pn0 'tcp port 30815' during a hung build showed SYNs arriving from source 192.168.64.2, retransmitting, with no SYN-ACK.
On a runner Mac host: pf was Enabled, ifconfig showed vlan0 with 172.16.0.13, but route -n get 172.16.0.0 returned en0 and the com.apple/tuist.vmnat anchor was empty. Hand-loading nat on vlan0 from 192.168.64.0/22 to 172.16.0.0/22 -> (vlan0) loaded cleanly, confirming the rule and interface are correct and only the detection was wrong.
Why this fix
The bootstrap creates exactly one PN VLAN (networksetup -createVLAN pn en0), so the vlan* device holding an inet address is unambiguously the PN interface. This avoids the network-base route-get quirk entirely and re-converges via the existing StartInterval LaunchDaemon once DHCP lands an address.
Validation
go build ./...,go vet ./...,gofmtclean (no existing unit test covers this inline SSH-delivered script).sh -non the rendered branch; on a non-VLAN host it correctly no-ops.- Live on a production runner host: hand-loaded the equivalent rule into
com.apple/tuist.vmnatonvlan0and pf accepted it, restoring the masquerade.
Rollout
The bootstrap re-runs on every reconcile, so once the provider image ships, the StartInterval daemon loads the corrected rule fleet-wide within a minute (no reboot/reprovision needed). Hand-loading the rule on a host is a safe interim unblock.
Related
This is the actual incident fix. #11418 (persist the Elastic Metal node’s pn0 across reboots) is a separate latent bug found during the same investigation — real, but it was not the cause here (that node’s pn0 was up the whole time).
🤖 Generated with Claude Code
No GitHub comments yet.