Hive Hive
Sign in

fix(kura,server): stop Kura cold-start replication storm and let the fix reach degraded servers

GitHub issue · Closed

Metadata
Source
tuist/tuist #11405
Updated
Jun 24, 2026
Domains
Kura
Details

Two paired changes. One stops the storm; the other makes the fix reach the servers that are already broken. Neither is sufficient on its own.

1. fix(kura): per-target replication backoff

Adds per-target exponential backoff (2s → 60s) to Kura’s outbox replication loop. process_outbox now skips an outbox message whose target peer is in a backoff window, clears a target’s backoff on the first successful replication, and extends it (capped at 60s) on each consecutive failure.

Root cause: a cold-starting Kura pod whose peers are unreachable re-fired its entire outbox backlog every cycle (REPLICATION_RETRY_SECS, 2s) with no per-target backoff — process_outbox iterates every queued message and leaves a failed one in the outbox. For a large account against a degraded mesh, the whole backlog retried every cycle, all failing: ~120/sec mTLS peer requests that blew past hyper’s 1024 locally-reset-stream limit (the CVE-2023-44487 / HTTP-2 rapid-reset mitigation). Once a connection hits that limit it is poisoned and every request fails with error sending request for url ...:7443, so the pod never reached a serving state. The transport code is identical across versions; this only surfaced once an unrelated bootstrap-budget bug stopped killing cold pods before they could storm.

Why backoff over alternatives: it collapses the retry pattern from “whole backlog every cycle” to ~1 attempt per peer per window, far under the 1024 limit, while still retrying promptly once a peer recovers (backoff clears on first success). Raising hyper’s reset limit only treats the symptom; a global rate limit would also throttle healthy peers.

2. fix(server): roll degraded Kura servers on a runtime image change

Widens the runtime-image reconciler’s rollout query (servers_needing_version_query) from status == :active to the present-intent set [:provisioning, :active, :failed].

Why: the reconciler only rolled :active servers, so a server stuck on a broken image could never receive the image that would fix it — leaving manual break-glass KuraInstance CR patches as the only recovery path. A :failed server can’t self-heal on the image that’s failing it, and the version rollout meant to rescue it skipped exactly those servers. The rollout path already handles degraded servers safely: apply_deployment writes the new image with no endpoint precondition, and the /up probe only gates the final :active marking after the pods already report the new image. The query still requires current_image_tag != desired, so a degraded server rolls once per version (no churn loop).

Why both, together

Without #1, cold pods storm and never join. Without #2, the backoff fix can’t reach the servers that are already down, because the reconciler skips them. Together, a normal release rolls the fix across the whole mesh — degraded servers included — and it self-heals, with no break-glass intervention.

Impact

Cold Kura pods can join a degraded mesh without self-poisoning their peer connections, and a degraded mesh recovers from a normal release on its own. No API or config surface change; backoff bounds are constants (REPLICATION_BACKOFF_BASE_SECS = 2, REPLICATION_BACKOFF_MAX_SECS = 60).

Validation

  • kura: cargo test --lib — 259 passed (257 existing + 2 new: process_outbox_backs_off_unreachable_target, process_outbox_skips_backed_off_target); cargo clippy --lib --tests + cargo fmt clean.
  • server: mix test test/tuist/kura_test.exs — 41 passed (+1 new: a :failed server with a drifted image gets a rollout scheduled); mix format + mix credo clean.
  • Prod validation against the live degraded mesh was deliberately skipped: the behavior is deterministic and unit-covered, the affected mesh serves no customers, and applying an unmerged ad-hoc image to production carried more risk than value. With both changes shipped, the currently-degraded mesh recovers on the next deploy without manual patches.

Merge note

The two commits are scoped separately on purpose (fix(kura) + fix(server)): the release tooling is commit-scope-gated, so cutting the Kura runtime release needs the fix(kura) scope preserved. Rebase-merge, or squash with a title that keeps the fix(kura) scope, so the Kura release is cut; the server change deploys via continuous deployment regardless.

Comments

No GitHub comments yet.