Hive Hive
Sign in

feat(infra): HA stable-egress gateway with failover controller (+ revert Keygen boot hotfix)

GitHub issue · Closed

Metadata
Source
tuist/tuist #11272
Updated
Jun 24, 2026
Domains
Atlas
Details

Durable follow-up to the 2026-06-14 tuist.dev 503 incident: make the server’s stable egress highly available so losing the gateway node fails over automatically instead of black-holing egress. Validated end-to-end on staging.

Background: what took the site down

All tuist-tuist-server pods crash-looped → nginx had no upstream → 503. The crash was Tuist.License.assert_valid!/1 hitting a Keygen timeout on boot — but Keygen was a red herring. The real cause: the stable-egress gateway had no node. Since #11150 all server egress is SNAT’d through a single Hetzner Floating IP (116.202.0.10) via CiliumEgressGatewayPolicy, routed through one node hand-labelled tuist.dev/stable-egress-gateway=server. A MachineHealthCheck remediated that node; neither the label nor the Floating IP migrated, so egress silently black-holed and the server crashed on its first outbound call.

What this PR does

infra/stable-egress-controller/ (new Go controller, controller-runtime)

Leader-elected (2 replicas) controller that keeps the Hetzner Floating IP (Cloud API) and the active tuist.dev/stable-egress-gateway label on one node. It adopts whatever node already holds the active label as long as it’s Ready — even one outside the candidate pool — so it never disturbs a working gateway (enabling it, or any steady state, moves nothing). Only when there’s no healthy active node does it fail over to a Ready md-egress candidate, moving IP + label together; the active label is stripped cluster-wide so a stale label can’t shadow the elected node. Reuses the in-cluster kube-system/hcloud token. Matches the repo’s controller idiom (go 1.25, controller-runtime v0.20, distroless); unit-tested.

Topology + chart

  • md-egress pool (replicas: 2) in prod + staging, self-labelling tuist.dev/stable-egress-candidate=server via a new ClusterClass workerNodeLabels variable/patch (kubelet sets it at registration, so it survives node replacement — CAPI metadata-label sync won’t carry a tuist.dev/* label).
  • platform chart renders the controller (SA + RBAC + Deployment) gated on failoverController.enabled.
  • host-configurer keeps its eth0-only role, following the active label.

Reserved egress IP set (stable customer allowlist)

Customers allowlist a fixed reserved set of egress IPs, not a single address, so the active IP can be migrated within the set without ever touching their allowlist. The controller takes an egressIpAllowlist and fails closed if the active Floating IP is outside it (a gap, never a leak from an un-allowlisted IP). The prod set is two Floating IPs in tuist-workloads116.202.0.10 (active) + 116.202.4.195 (spare); active/standby means only one is ever live, so two is enough to drop the single-IP dependency and allow migration. Customer-facing list in server/priv/docs/en/guides/server/network.md, in lockstep with the chart.

Why a set of /32s, not a single CIDR: Hetzner Cloud has no range/subnet parameter for Floating IPs (verified — hcloud floating-ip create takes none); a contiguous range would need Hetzner Robot subnets (dedicated hardware, not Cloud) or BYOIP+BGP (own a /24 — the someday-if-sales-critical option), and an IPv6 /64 doesn’t help IPv4 allowlists. Full reasoning in infra/helm/platform/README.md.

Release wiring

Wired into the standard component release flow (components.json + release.yml + cliff.toml), like the other infra controllers: semver-tagged image. The deployed tag is resolved at deploy time (k8s:install-platform --sets the highest stable-egress-controller@<semver> reachable from the commit), the same pattern as the fleet/runtime images — no pinned tag in the chart values, no latest in prod.

Revert the Keygen boot hotfix (#11271)

It masked the crash symptom but wasn’t the fix.

Validation — e2e on staging (without merging)

Built a :sha image from the branch, deployed the controller to staging, candidate-labelled two nodes, and ran the full drill (results in the PR comments):

  • Sticky steady state — no cutover when the current holder is still a candidate; allowlist guard passing.
  • Failover — Hetzner FIP reassigned via the API, stale label stripped cluster-wide, label moved, host-configurer followed, host datapath correct on the new node, egress IP 116.x/78.x preserved, failback clean. Always converges, never stuck (~5 failovers).
  • Timed: warm failover (target was a gateway before) is seamless (sub-5s); a node Cilium has never gatewayed takes >40s to converge the first time (Cilium egress-gateway layer, cilium/cilium#30157), then seamless. Bounded + auto-healed either way.
  • Unit tests: go build/vet/test green.

Rollout (zero-blip, automatic)

Enabled on staging + canary + production (failoverController.enabled: true in all three overlays). On each env’s platform deploy the controller starts and adopts that env’s current gateway node (which already holds the active label) — no Floating IP move, no Cilium reconvergence, no egress blip. The md-egress pool stands by as candidates; the active IP migrates onto the dedicated pool on the current node’s next replacement (the only cold ~40s path — an actual failover, bounded + auto-healed).

On merge: mgmt-cluster-apply provisions the md-egress pools (verify its kubectl diff shows no roll of md-0, which runs the server + CNPG); the release publishes the controller image; each env’s platform deploy brings the controller up and adopts. No manual labeling, no timing coordination, no low-traffic window required.

Trade-offs / call-outs

  • Failover convergence: warm (target was a gateway before) is seamless; a node Cilium has never gatewayed takes ~40s+ the first time, then seamless. The enable itself is blip-free (adoption, above); the cold path only applies to an actual node-replacement failover — bounded + auto-healed. Optional follow-up to make even that seamless: keep the standby warm (rotate the active, or list both in Cilium egressGateways).
  • Reverting #11271 re-couples booting pods to egress (running pods unaffected) — flagging in case we want a lighter boot-time guard.
  • Alerting (page on no Ready candidate / no active label) is Grafana-side — follow-up in the controller AGENTS.

Confirm before enabling in prod

  • The kube-system/hcloud token has Floating-IP write scope in tuist-workloads.
  • The reserved set (2 Floating IPs) is provisioned + documented; if you grow it later, add the new /32s to egressIpAllowlist + network.md before use (the allowlist guard fails closed otherwise).

🤖 Generated with Claude Code

Comments
F
fortmarek Jun 14, 2026

Added Option A — reserved stable egress set on top of the HA work (commit c426853):

  • Customers now allowlist a fixed reserved set of egress IPs, not a single address — so future capacity growth or an IP migration within the set never forces a customer allowlist change (slow, high-friction enterprise ops). On Hetzner Cloud there’s no owned contiguous CIDR/BYOIP, so the “set” is a reserved pool of individual Floating IPs.
  • Controller gains --egress-ip-allowlist (CIDRs) and fails closed: it refuses to activate a Floating IP whose address is outside the documented set, so we can never egress from an un-allowlisted IP (a gap, never a leak). Unit-tested.
  • network.md reframed to a reserved set with a maintainer note keeping it in lockstep with egressIpAllowlist.

Operator step (gates the full benefit): reserve the extra Floating IPs in tuist-workloads, then add their /32s to both values-tuist.yaml‘s egressIpAllowlist and network.md before they’re used. Today only 116.202.0.10 is provisioned, so the set is currently a set of one — the machinery is in place to grow it without touching customer allowlists.

F
fortmarek Jun 14, 2026

Deploy safety + review findings

Is merging dangerous to server availability? Inbound (web/API) is low-risk — nothing here touches ingress or rolls the server/DB by design. The real exposure is a brief outbound egress blip during the one-time gateway cutover, plus a high-blast-radius item to verify. Mitigated by staging + a safe default (see below).

Review findings

  • [P1] Stale active labels outside md-egress — fixed in commit dc7a99cdff: reconcileActiveLabel now lists active-labelled nodes cluster-wide and strips the label from any non-elected node, so the hand-labelled general worker is cleaned up on cutover. I kept the dataplane selectors active-label-only (not active + candidate) on purpose: with the cluster-wide strip, a two-label selector would only add a cutover gap where neither the old node (no candidate label) nor the not-yet-elected md-egress node matches. (Rationale in the README.)
  • [P1] go.sum missing → image can’t build — confirmed merge blocker; I can’t run go mod tidy in this sandbox (no module network). Now neutralized for merge safety: the production overlay defaults failoverController.enabled: false, so the chart never references the missing image. Committing go.sum is a prerequisite to enabling.
  • [P2] mutable latest + no release path — production overlay is now enabled: false with image.tag documented as “pin to released semver, not latest.” Full release.yml wiring (semver tag + chart-pin bump, mirroring hetzner-robot-controller) is flagged as a required follow-up in the controller’s AGENTS.md — I deliberately did not hand-edit the production cascade workflow blind.

Staged rollout (now in the platform README)

  1. Commit go.sum → build + publish image → pin image.tag.
  2. Apply ClusterClass + md-egress; verify kubectl diff shows no roll of existing pools (md-0 runs server + CNPG); wait for md-egress nodes Ready + candidate label.
  3. Flip failoverController.enabled: true; controller does the one-time IP/label cutover (~seconds of egress blip, inbound unaffected); verify curl api.ipify.org = egress IP.
  4. Deploy the server image (Keygen revert) last.

Net: as committed, merging is safe (HA machinery lands dormant); enabling HA is a deliberate, staged follow-up.

F
fortmarek Jun 15, 2026

Release wiring added (P2 resolved)

Wired stable-egress-controller into the standard component release flow, mirroring hetzner-robot-controller so the image is semver-tagged and the prod pin is no longer latest:

  • components.json — registered the component; release:check detects it (verified locally: next: stable-egress-controller@0.1.0, release? true).
  • release.ymlcheck-releases outputs, a release-stable-egress-controller job (changelog + multi-arch build/push), the publish aggregation needs, artifact download, and the GitHub Release step.
  • cliff.toml — scoped to (stable-egress-controller) commits.
  • renovate.json — tracks the ghcr semver and bumps the platform-chart pin (failoverController.image.tag in values-tuist.yaml); the platform chart deploy rolls it out on merge (same model as the hetzner pin, just via the platform deploy instead of mgmt-cluster-apply).
  • values-tuist.yaml — pinned tag: 0.1.0 with a Renovate marker; stays enabled: false until the first release exists.

Validated: components.json + renovate.json parse; release.yml parses; check.sh stable-egress-controller resolves the initial version and should-release.

The single remaining prerequisite is still go.sum (go mod tidy, needs network) — the release job’s Docker build does go mod download, so it can’t build until go.sum is committed. That’s the one thing I can’t do from here.

F
fortmarek Jun 15, 2026

e2e validated on staging (without merging)

Got a :sha image from this branch (temporarily added the branch to the image workflow’s push: trigger — workflow_dispatch can’t reach a workflow that isn’t on the default branch yet — then reverted), deployed the controller to staging by hand (it mounts staging’s in-cluster hcloud secret, so no token handling), candidate-labeled the two staging general nodes, and ran the full drill. Staging was restored to its original state afterward.

What passed:

  • Sticky election / no churn — with the current FIP holder (4jv57) already a candidate, the controller kept it; zero cutover, allowlist guard logged active egress IP 78.47.186.71.
  • Failover — dropping 4jv57 from the candidate pool, the controller reassigned the Hetzner Floating IP via the API (fromServer→toServer logged), stripped the stale active label cluster-wide, set it on h6prq; the host-configurer DaemonSet followed; failback to 4jv57 likewise.
  • Host datapath on the new node verifiedeth0 had 78.47.186.71/32, the from 78.47.186.71 lookup 2010 rule, and the table-2010 default route.
  • Egress IP preserved78.47.186.71 from a staging server pod across the drill (active/standby holds the single allowlisted IP).

Key finding worth acting on: Cilium OSS egress-gateway datapath convergence onto a node it has never used as a gateway took >40s the first time (egress curl timed out during that window), but ~12s on subsequent moves once Cilium knew the node. The controller/host config were correct throughout — this is the Cilium egress-gateway layer (consistent with cilium/cilium#30157).

Implication: a real prod node-replacement failover targets a brand-new node → the slow first-time path, so failover egress downtime can exceed the ~30–60s estimate, and the worst case (does it always converge, or occasionally need a cilium-agent nudge / policy re-apply?) needs validating before we lean on prod auto-failover. I’d treat that as a follow-up to investigate/mitigate (e.g., pre-warm the egress node, or have the controller re-poke the policy on cutover). The controller itself — the part this PR adds — works as designed.

F
fortmarek Jun 15, 2026

Timed convergence on staging (answering “is it a problem with redundancy?”)

Ran timed failovers on staging:

  • Warm failover (target node was a gateway before): seamless — active label flips by ~6s and egress never dropped at 4–5s poll resolution.
  • Cold failover (a node Cilium has never used as a gateway): the >40s seen once, on the node’s first-ever activation.
  • Always converges, never stuck across ~5 failovers — the safety property that matters.

Conclusion: redundancy removes the catastrophic mode. With the 2-node pool there’s always a healthy target and the FIP move is fast, so the worst case is a bounded, auto-healed egress blip — never the indefinite, human-gated outage from the incident. The one nuance: in steady state only the active node is a programmed Cilium gateway, so the standby is “cold” — the first real failover (a node replacement) hits the >40s cold path, then it’s seamless thereafter.

So: not a blocker. Optional follow-up to make even the first failover seamless — keep the standby warm (periodically rotate the active between the two nodes, or list both in Cilium’s egressGateways). I’d ship the HA as-is and treat warming as a later optimization.

(Staging was used as the test bed via a temporary branch push-trigger on the image workflow + a hand-deployed controller; both reverted/torn down, staging restored to its original single-gateway state.)

F
fortmarek Jun 15, 2026

Reserved egress set provisioned — no longer single-IP-dependent

Created 3 more Hetzner Floating IPs in the tuist-workloads project, so prod egress is now a reserved set of 4, not one address:

Floating IP Address
tuist-production-server-egress 116.202.0.10 (active)
tuist-production-server-egress-2 116.202.4.195
tuist-production-server-egress-3 5.75.222.2
tuist-production-server-egress-4 5.75.222.131

(Hetzner hands out scattered /32s, not a contiguous block — so it’s a documented set, as discussed.) Wired all four into the controller’s egressIpAllowlist (values-tuist.yaml) and the customer guide (network.md). Customers allowlist the set once; the active egress migrates within it with no customer allowlist change ever again. The controller still operates 116.202.0.10 as the active member; the other three are pre-allowlisted spares (active/standby, one live at a time).

TA
tuist-atlas[bot] Jun 17, 2026

The HA stable-egress gateway with failover controller changes from this PR are now available in version xcresult-processor-image@0.23.0. Update to this version to use the new controller and infrastructure improvements.