Hive Hive
Sign in

fix(stable-egress-controller): stop hammering the Hetzner API on node heartbeats

GitHub issue · Closed

Metadata
Source
tuist/tuist #11379
Updated
Jun 24, 2026
Domains
Compute
Details

What

Cuts the stable-egress failover controller’s Hetzner Cloud API call rate by ~20x in steady state, via two changes in infra/stable-egress-controller:

  1. Coalesce the per-reconcile reads. Each reconcile made two FloatingIP.GetByName calls — one for Address (allowlist gate) and one for CurrentServerID (assignment check). The FloatingIPManager interface now exposes a single Get that returns both the address and the current server id from one lookup; the reconciler reads the IP once and uses it for both.
  2. Filter the Node watch. The controller watched corev1.Node and funneled every event to a reconcile. A new event predicate drops the updates that can’t change gateway eligibility — chiefly the kubelet status heartbeats and lease renewals that fire every few seconds per node — and only lets through Ready transitions, candidate/active label changes, and node termination. Create/Delete/Generic events still always reconcile.

Why

During this Sunday’s tuist.dev scare I was checking the egress path (the June 14 outage’s root cause) and found the controller logs full of:

Reconciler error ... looking up Floating IP "tuist-production-server-egress":
Limit of '3600' requests per hour for token '...' reached. (rate_limit_exceeded)

Root cause

The controller reconciled every ~2-3s (visible in the logs), far above its 30s resync floor, because the Watches(&corev1.Node{}) source enqueued a reconcile on every node event — and in a fleet of Mac minis, runners, and workers, kubelet status/lease heartbeats are near-constant. At 2 GetByName calls per reconcile that’s ~2,880 calls/hr from this controller alone, and the hcloud token is shared with cluster-api (caph node lifecycle). Together they exceed Hetzner’s 3,600 req/hr cap.

Impact

  • In steady state it’s just log noise — reconciles are idempotent and the IP stays put.
  • The real risk: if a gateway node dies while the controller is in a rate-limited window, the FloatingIP.Assign failover call can stall or fail, prolonging an egress gap — the exact failure mode (no stable egress → server can’t reach Keygen → CrashLoopBackOff → 503) that took prod down on June 14.
  • It also burns the shared Hetzner token budget that caph needs for node provisioning/autoscaling.

Why this approach

The two reads were trivially mergeable (the Hetzner FloatingIP object already carries both the address and the assigned server), so coalescing is free. The bigger lever is the watch predicate: node heartbeats carry no information the election keys on, so filtering them removes the dominant trigger without weakening failover — a node going NotReady still flips its Ready condition (caught by the predicate), and deletions still always reconcile. Lowering the resync interval would have been the blunt alternative, but it doesn’t fix event-driven churn and trades away the gap-detection floor.

Effect

Steady-state Hetzner reads drop from ~2,880/hr to ~120/hr (one read per 30s resync), with meaningful node events reconciling on top. Well under the shared 3,600/hr cap.

Validation

  • go build ./... && go vet ./... — clean.
  • go test ./... — all pass, including a new TestNodeEventPredicate (heartbeat-only update dropped; Ready transition, candidate/active label change, and termination let through; Create/Delete always reconcile) and the existing reconcile/failover/allowlist suite updated for the Get interface.
  • gofmt -l — clean.

No production change is applied by this PR. The deployed image tag is resolved at deploy time from the highest stable-egress-controller@<semver> reachable from the merged commit, so this ships on the next stable-egress-controller@… release once merged. Filed as a draft for review.

Note (separate, not addressed here)

The thing that actually made tuist.dev look “down” this time was not an outage — production was 5/5 healthy throughout. It was stale client-side browser state (clearing storage / incognito fixed it); the site loaded fine on mobile and from CI. The AGENTS.md “Future work” item about alerting when no Ready candidate exists is still worth doing separately.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.