fix(infra): NAT macOS runner VM cache traffic onto the Private Network

Metadata

Source

tuist/tuist #11423

Updated

Jun 24, 2026

Domains

Compute Kura

Details

What

Fix the macOS runner host’s VM-egress NAT so Tart VM cache traffic is masqueraded onto the Private Network. tuist-pf-vmnat now detects the PN VLAN interface by finding the vlan* device that holds an inet address, instead of route -n get <PN-network-base>.

Root cause

macOS runners reach the per-account Kura cache over a PN NodePort (172.16.0.0/22:30000-32767). The Tart VM (192.168.64.0/22) egress is supposed to be source-NAT’d to the host’s PN address so the kura node can reply and the NodePort NetworkPolicy (which admits only the PN CIDR) lets it in.

tuist-pf-vmnat derived the NAT interface with:

PNIF=$(route -n get "${PNCIDR%%/*}" | awk '/interface/{print $2}')   # route -n get 172.16.0.0
case "$PNIF" in vlan*) ... ;; esac

${PNCIDR%%/*} is the network base address (172.16.0.0). macOS resolves a network base address to the parent physical NIC (en0), not the vlan that owns the subnet. So PNIF=en0, the case vlan* gate never matched, the PN NAT rule was never added, RULES stayed empty, and the script exited loading nothing into com.apple/tuist.vmnat.

With no NAT, VM cache packets reached the kura node with their raw 192.168.64.2 source. That’s fatal twice over: every Tart VM is 192.168.64.2 (so the node can’t route a reply, and couldn’t disambiguate hosts if it tried), and the source isn’t in the PN CIDR the NetworkPolicy allows, so it’s dropped at ingress. The TCP handshake never completed, the runner retransmitted SYNs, and builds hung indefinitely at Fetching remote binaries via module cache. Hold tight....

How it was diagnosed

On the production kura node, pn0 was up with 172.16.0.2, the node-port served locally (200 in 1ms), the PN bridged, and the NetworkPolicy/endpoints were correct — yet Prometheus showed kura-tuist-scw-fr-par-0 with heavy internal traffic but zero kura_public_request_latency_seconds_count. tcpdump -ni pn0 'tcp port 30815' during a hung build showed SYNs arriving from source 192.168.64.2, retransmitting, with no SYN-ACK.

On a runner Mac host: pf was Enabled, ifconfig showed vlan0 with 172.16.0.13, but route -n get 172.16.0.0 returned en0 and the com.apple/tuist.vmnat anchor was empty. Hand-loading nat on vlan0 from 192.168.64.0/22 to 172.16.0.0/22 -> (vlan0) loaded cleanly, confirming the rule and interface are correct and only the detection was wrong.

Why this fix

The bootstrap creates exactly one PN VLAN (networksetup -createVLAN pn en0), so the vlan* device holding an inet address is unambiguously the PN interface. This avoids the network-base route-get quirk entirely and re-converges via the existing StartInterval LaunchDaemon once DHCP lands an address.

Validation

go build ./..., go vet ./..., gofmt clean (no existing unit test covers this inline SSH-delivered script).
sh -n on the rendered branch; on a non-VLAN host it correctly no-ops.
Live on a production runner host: hand-loaded the equivalent rule into com.apple/tuist.vmnat on vlan0 and pf accepted it, restoring the masquerade.

Rollout

The bootstrap re-runs on every reconcile, so once the provider image ships, the StartInterval daemon loads the corrected rule fleet-wide within a minute (no reboot/reprovision needed). Hand-loading the rule on a host is a safe interim unblock.