Durable follow-up to the 2026-06-14 tuist.dev 503 incident: make the server’s stable egress highly available so losing the gateway node fails over automatically instead of black-holing egress. Validated end-to-end on staging.
Background: what took the site down
All tuist-tuist-server pods crash-looped → nginx had no upstream → 503. The crash was Tuist.License.assert_valid!/1 hitting a Keygen timeout on boot — but Keygen was a red herring. The real cause: the stable-egress gateway had no node. Since #11150 all server egress is SNAT’d through a single Hetzner Floating IP (116.202.0.10) via CiliumEgressGatewayPolicy, routed through one node hand-labelled tuist.dev/stable-egress-gateway=server. A MachineHealthCheck remediated that node; neither the label nor the Floating IP migrated, so egress silently black-holed and the server crashed on its first outbound call.
What this PR does
infra/stable-egress-controller/ (new Go controller, controller-runtime)
Leader-elected (2 replicas) controller that keeps the Hetzner Floating IP (Cloud API) and the active tuist.dev/stable-egress-gateway label on one node. It adopts whatever node already holds the active label as long as it’s Ready — even one outside the candidate pool — so it never disturbs a working gateway (enabling it, or any steady state, moves nothing). Only when there’s no healthy active node does it fail over to a Ready md-egress candidate, moving IP + label together; the active label is stripped cluster-wide so a stale label can’t shadow the elected node. Reuses the in-cluster kube-system/hcloud token. Matches the repo’s controller idiom (go 1.25, controller-runtime v0.20, distroless); unit-tested.
Topology + chart
md-egress pool (replicas: 2) in prod + staging, self-labelling tuist.dev/stable-egress-candidate=server via a new ClusterClass workerNodeLabels variable/patch (kubelet sets it at registration, so it survives node replacement — CAPI metadata-label sync won’t carry a tuist.dev/* label).
- platform chart renders the controller (SA + RBAC + Deployment) gated on
failoverController.enabled.
- host-configurer keeps its eth0-only role, following the active label.
Reserved egress IP set (stable customer allowlist)
Customers allowlist a fixed reserved set of egress IPs, not a single address, so the active IP can be migrated within the set without ever touching their allowlist. The controller takes an egressIpAllowlist and fails closed if the active Floating IP is outside it (a gap, never a leak from an un-allowlisted IP). The prod set is two Floating IPs in tuist-workloads — 116.202.0.10 (active) + 116.202.4.195 (spare); active/standby means only one is ever live, so two is enough to drop the single-IP dependency and allow migration. Customer-facing list in server/priv/docs/en/guides/server/network.md, in lockstep with the chart.
Why a set of /32s, not a single CIDR: Hetzner Cloud has no range/subnet parameter for Floating IPs (verified — hcloud floating-ip create takes none); a contiguous range would need Hetzner Robot subnets (dedicated hardware, not Cloud) or BYOIP+BGP (own a /24 — the someday-if-sales-critical option), and an IPv6 /64 doesn’t help IPv4 allowlists. Full reasoning in infra/helm/platform/README.md.
Release wiring
Wired into the standard component release flow (components.json + release.yml + cliff.toml), like the other infra controllers: semver-tagged image. The deployed tag is resolved at deploy time (k8s:install-platform --sets the highest stable-egress-controller@<semver> reachable from the commit), the same pattern as the fleet/runtime images — no pinned tag in the chart values, no latest in prod.
Revert the Keygen boot hotfix (#11271)
It masked the crash symptom but wasn’t the fix.
Validation — e2e on staging (without merging)
Built a :sha image from the branch, deployed the controller to staging, candidate-labelled two nodes, and ran the full drill (results in the PR comments):
- Sticky steady state — no cutover when the current holder is still a candidate; allowlist guard passing.
- Failover — Hetzner FIP reassigned via the API, stale label stripped cluster-wide, label moved, host-configurer followed, host datapath correct on the new node, egress IP
116.x/78.x preserved, failback clean. Always converges, never stuck (~5 failovers).
- Timed: warm failover (target was a gateway before) is seamless (sub-5s); a node Cilium has never gatewayed takes >40s to converge the first time (Cilium egress-gateway layer, cilium/cilium#30157), then seamless. Bounded + auto-healed either way.
- Unit tests:
go build/vet/test green.
Rollout (zero-blip, automatic)
Enabled on staging + canary + production (failoverController.enabled: true in all three overlays). On each env’s platform deploy the controller starts and adopts that env’s current gateway node (which already holds the active label) — no Floating IP move, no Cilium reconvergence, no egress blip. The md-egress pool stands by as candidates; the active IP migrates onto the dedicated pool on the current node’s next replacement (the only cold ~40s path — an actual failover, bounded + auto-healed).
On merge: mgmt-cluster-apply provisions the md-egress pools (verify its kubectl diff shows no roll of md-0, which runs the server + CNPG); the release publishes the controller image; each env’s platform deploy brings the controller up and adopts. No manual labeling, no timing coordination, no low-traffic window required.
Trade-offs / call-outs
- Failover convergence: warm (target was a gateway before) is seamless; a node Cilium has never gatewayed takes ~40s+ the first time, then seamless. The enable itself is blip-free (adoption, above); the cold path only applies to an actual node-replacement failover — bounded + auto-healed. Optional follow-up to make even that seamless: keep the standby warm (rotate the active, or list both in Cilium
egressGateways).
- Reverting #11271 re-couples booting pods to egress (running pods unaffected) — flagging in case we want a lighter boot-time guard.
- Alerting (page on no Ready candidate / no active label) is Grafana-side — follow-up in the controller AGENTS.
Confirm before enabling in prod
- The
kube-system/hcloud token has Floating-IP write scope in tuist-workloads.
- The reserved set (2 Floating IPs) is provisioned + documented; if you grow it later, add the new
/32s to egressIpAllowlist + network.md before use (the allowlist guard fails closed otherwise).
🤖 Generated with Claude Code