fix(infra): persist Elastic Metal kura node PN VLAN across reboots

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11418

Updated

Jun 24, 2026

Domains

Kura

Details

What

Make the runner-cache Elastic Metal node’s Private Network VLAN (pn0) durable. Replaces the one-shot ip link + backgrounded dhclient -nw in the self-join with a supervised tuist-pn0.service systemd unit:

ExecStartPre re-creates pn0 on every boot (idempotent, same runtime NIC detection as before),
ExecStart runs dhclient -d pn0 in the foreground under Restart=always, so the lease is always held and renewed.

Only the Elastic Metal self-join (the SSH, non-indented bootstrap path) renders this; the Instance/cloud-init path always passes vlan == 0 and renders nothing PN-related. A new test (TestRenderLinuxBootstrapScript_PNVlanIsPersistentAndSupervised) locks in the unit, the supervised dhclient, the VLAN-id wiring, the absence of the old one-shot, and that the Instance path stays clean.

Why / root cause

The macOS runners reach the Paris Kura cache over the PN NodePort (172.16.0.2:30815). On 2026-06-22, macOS-runner CI builds hung indefinitely at Fetching remote binaries via module cache. Hold tight....

The cache pod was healthy and fully meshed the whole time. Prometheus confirmed it: kura-tuist-scw-fr-par-0 served heavy internal traffic (/_internal/status, /_internal/replicate/artifact, bootstrap routes) over the cluster pod network, but zero public/runner-facing requests (kura_public_request_latency_seconds_count never incremented for it, while every other tuist pod served 1k-7k in the same window). So the runners simply could not reach it over the PN.

Root cause: the VLAN bring-up was non-persistent. vlanBringUp ran ip link add ... pn0 + ip link set up + dhclient -nw pn0 exactly once during the self-join, with no systemd/netplan persistence. A reboot drops pn0 entirely, and the backgrounded dhclient -nw is not supervised, so if it dies or the lease lapses the address is gone with nothing to renew it. Either way the node keeps reporting Ready (kubelet binds the public InternalIP, not the PN), and all k8s state stays correct-but-stale (tuist.dev/pn-ipv4 label, KuraInstance.status.nodeAddress, the NodePort, the NetworkPolicy that allows 172.16.0.0/22). The runner’s SYN to the dead NodePort is blackholed, and the CLI module-cache fetch has no timeout, so a PN-reachability loss becomes an indefinite build hang.

Why this approach

A systemd unit reuses the exact ip link + dhclient commands already proven to work at provision time, so it introduces no new netplan/networkd behavior on the Scaleway EM image, while fixing both failure modes (reboot loss via enabled unit + ExecStartPre; lease/process death via foreground dhclient -d under Restart=always). A netplan vlans: stanza would also work but changes the network-management stack on a stock image we can’t easily revalidate.

Impact

New Elastic Metal kura nodes get a PN address that survives reboots and lease churn.
No change to the Instance kind or the macOS fleet bootstrap.
The already-provisioned Paris node is not retroactively fixed by this PR (it bootstrapped with the old code). It needs the live pn0 re-materialized (or a re-provision once this ships) to recover. Tracked separately.

Validation

go build ./..., go vet ./controllers/, full controllers test suite green, gofmt clean.
New unit test asserts the rendered script.
NOT yet validated on a real host. This touches prod node bootstrap networking and must be exercised on a staging Elastic Metal node (provision, confirm pn0/172.16.0.2 comes up, reboot, confirm it returns) before merging. Draft until then.

Follow-ups (separate)

The Mac-host client leg (scalewayapplesiliconmachine_controller.go “materializes the VLAN interface + firewall pass + VM NAT”) uses the same one-shot pattern and has the same latent bug.
The CLI module-cache fetch should time out and degrade to a cache miss rather than hang indefinitely when the cache endpoint is unreachable.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.