Hive Hive
Sign in

fix(infra): hold kura PN address statically instead of renew-forever dhclient

GitHub issue · Closed

Metadata
Source
tuist/tuist #11476
Updated
Jun 24, 2026
Domains
Kura
Details

What

Replace the renew-forever dhclient -d pn0 unit on the runner-cache Elastic Metal nodes with one that discovers the PN address via DHCP once and then holds it as a static address.

Why

A production runner-cache node silently lost its pn0 Private Network address (172.16.0.2/22), blackholing every macOS runner’s module-cache request to the cache NodePort for 12+ hours. macOS CI builds hung at “Fetching remote binaries via module cache. Hold tight…” with roughly 5-minute timeouts per module.

Root cause

On-node diagnosis showed pn0 UP but carrying only a link-local IPv6 address. dhclient was alive and repeatedly sending DHCPREQUEST to the Scaleway PN DHCP server (169.254.169.254) with no DHCPACK ever returned: the lease had expired and could not be renewed. With the route to 172.16.0.2 falling through to the public default gateway, the node never answered ARP for its own PN IP, so runner SYNs to the cache NodePort were blackholed. The node kept reporting Ready on its public NIC, so the loss was invisible to the control plane.

The existing supervised unit (ExecStart=dhclient -d pn0, Restart=always) does not help this failure mode: the dhclient process stays alive while the server is silent, so Restart=always never fires, yet the lease still lapses and the address is dropped.

Fix

The PN IP is stable per attachment (IPAM-assigned), so there is no need to keep leasing it. The unit now:

  1. Brings up pn0, recreating the VLAN interface on every boot.
  2. On first boot, DHCPs once to discover the assigned address and caches it (retrying while IPAM lags).
  3. Holds that address statically (ip addr replace, no expiry) and re-asserts it.

This makes the PN address durable across reboots, lease churn, and a silent DHCP server.

Impact

  • New runner-cache EM nodes provisioned from this code are immune to the PN-DHCP-renewal failure that caused the outage.
  • Existing live nodes are not retroactively updated: Linux node bootstrap is one-shot at provision with no drift re-push, so the affected node was hand-patched live and needs either a re-provision or the unit installed to converge.

Validation

  • go test ./controllers/ passes, including the updated render test asserting the static-hold unit and the absence of a renew-forever dhclient.
  • The static-hold mechanism is confirmed working on the live affected node: the address was restored with ip addr replace, the cache NodePort then served /up in under 1ms, and public-request traffic recovered from zero.
  • Not yet validated on a fresh EM provision end-to-end. The DHCP-once discovery plus cache path should be exercised on a staging EM node before relying on it fleet-wide.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.