Hive
fix(stable-egress-controller): stop hammering the Hetzner API on node heartbeats
GitHub issue · Closed
What
Cuts the stable-egress failover controller’s Hetzner Cloud API call rate by ~20x in steady state, via two changes in infra/stable-egress-controller:
- Coalesce the per-reconcile reads. Each reconcile made two
FloatingIP.GetByNamecalls — one forAddress(allowlist gate) and one forCurrentServerID(assignment check). TheFloatingIPManagerinterface now exposes a singleGetthat returns both the address and the current server id from one lookup; the reconciler reads the IP once and uses it for both. - Filter the Node watch. The controller watched
corev1.Nodeand funneled every event to a reconcile. A new event predicate drops the updates that can’t change gateway eligibility — chiefly the kubelet status heartbeats and lease renewals that fire every few seconds per node — and only lets through Ready transitions, candidate/active label changes, and node termination. Create/Delete/Generic events still always reconcile.
Why
During this Sunday’s tuist.dev scare I was checking the egress path (the June 14 outage’s root cause) and found the controller logs full of:
Reconciler error ... looking up Floating IP "tuist-production-server-egress":
Limit of '3600' requests per hour for token '...' reached. (rate_limit_exceeded)
Root cause
The controller reconciled every ~2-3s (visible in the logs), far above its 30s resync floor, because the Watches(&corev1.Node{}) source enqueued a reconcile on every node event — and in a fleet of Mac minis, runners, and workers, kubelet status/lease heartbeats are near-constant. At 2 GetByName calls per reconcile that’s ~2,880 calls/hr from this controller alone, and the hcloud token is shared with cluster-api (caph node lifecycle). Together they exceed Hetzner’s 3,600 req/hr cap.
Impact
- In steady state it’s just log noise — reconciles are idempotent and the IP stays put.
- The real risk: if a gateway node dies while the controller is in a rate-limited window, the
FloatingIP.Assignfailover call can stall or fail, prolonging an egress gap — the exact failure mode (no stable egress → server can’t reach Keygen → CrashLoopBackOff → 503) that took prod down on June 14. - It also burns the shared Hetzner token budget that caph needs for node provisioning/autoscaling.
Why this approach
The two reads were trivially mergeable (the Hetzner FloatingIP object already carries both the address and the assigned server), so coalescing is free. The bigger lever is the watch predicate: node heartbeats carry no information the election keys on, so filtering them removes the dominant trigger without weakening failover — a node going NotReady still flips its Ready condition (caught by the predicate), and deletions still always reconcile. Lowering the resync interval would have been the blunt alternative, but it doesn’t fix event-driven churn and trades away the gap-detection floor.
Effect
Steady-state Hetzner reads drop from ~2,880/hr to ~120/hr (one read per 30s resync), with meaningful node events reconciling on top. Well under the shared 3,600/hr cap.
Validation
go build ./... && go vet ./...— clean.go test ./...— all pass, including a newTestNodeEventPredicate(heartbeat-only update dropped; Ready transition, candidate/active label change, and termination let through; Create/Delete always reconcile) and the existing reconcile/failover/allowlist suite updated for theGetinterface.gofmt -l— clean.
No production change is applied by this PR. The deployed image tag is resolved at deploy time from the highest stable-egress-controller@<semver> reachable from the merged commit, so this ships on the next stable-egress-controller@… release once merged. Filed as a draft for review.
Note (separate, not addressed here)
The thing that actually made tuist.dev look “down” this time was not an outage — production was 5/5 healthy throughout. It was stale client-side browser state (clearing storage / incognito fixed it); the site loaded fine on mobile and from CI. The AGENTS.md “Future work” item about alerting when no Ready candidate exists is still worth doing separately.
🤖 Generated with Claude Code
No GitHub comments yet.