Hive
fix(infra): hold kura PN address statically instead of renew-forever dhclient
GitHub issue · Closed
What
Replace the renew-forever dhclient -d pn0 unit on the runner-cache Elastic Metal nodes with one that discovers the PN address via DHCP once and then holds it as a static address.
Why
A production runner-cache node silently lost its pn0 Private Network address (172.16.0.2/22), blackholing every macOS runner’s module-cache request to the cache NodePort for 12+ hours. macOS CI builds hung at “Fetching remote binaries via module cache. Hold tight…” with roughly 5-minute timeouts per module.
Root cause
On-node diagnosis showed pn0 UP but carrying only a link-local IPv6 address. dhclient was alive and repeatedly sending DHCPREQUEST to the Scaleway PN DHCP server (169.254.169.254) with no DHCPACK ever returned: the lease had expired and could not be renewed. With the route to 172.16.0.2 falling through to the public default gateway, the node never answered ARP for its own PN IP, so runner SYNs to the cache NodePort were blackholed. The node kept reporting Ready on its public NIC, so the loss was invisible to the control plane.
The existing supervised unit (ExecStart=dhclient -d pn0, Restart=always) does not help this failure mode: the dhclient process stays alive while the server is silent, so Restart=always never fires, yet the lease still lapses and the address is dropped.
Fix
The PN IP is stable per attachment (IPAM-assigned), so there is no need to keep leasing it. The unit now:
- Brings up
pn0, recreating the VLAN interface on every boot. - On first boot, DHCPs once to discover the assigned address and caches it (retrying while IPAM lags).
- Holds that address statically (
ip addr replace, no expiry) and re-asserts it.
This makes the PN address durable across reboots, lease churn, and a silent DHCP server.
Impact
- New runner-cache EM nodes provisioned from this code are immune to the PN-DHCP-renewal failure that caused the outage.
- Existing live nodes are not retroactively updated: Linux node bootstrap is one-shot at provision with no drift re-push, so the affected node was hand-patched live and needs either a re-provision or the unit installed to converge.
Validation
go test ./controllers/passes, including the updated render test asserting the static-hold unit and the absence of a renew-forever dhclient.- The static-hold mechanism is confirmed working on the live affected node: the address was restored with
ip addr replace, the cache NodePort then served/upin under 1ms, and public-request traffic recovered from zero. - Not yet validated on a fresh EM provision end-to-end. The DHCP-once discovery plus cache path should be exercised on a staging EM node before relying on it fleet-wide.
🤖 Generated with Claude Code
No GitHub comments yet.