Hive Hive
Sign in

fix(server): fail runner-cache dispatch over to public when the Kura node is unready

GitHub issue · Closed

Metadata
Source
tuist/tuist #11461
Updated
Jun 24, 2026
Domains
Kura
Details

What

Make the runner-cache cache-endpoint dispatch readiness-aware so a wedged Kura node degrades builds to the public cache instead of timing them out.

  • New kura_servers.last_ready_at heartbeat. The reconciler stamps it each tick a private node-port server’s endpoint is observable, and stops stamping while it isn’t.
  • Kura.runner_cache_endpoint_url/2 now serves a node-port endpoint only while that heartbeat is fresh (120s window). Once stale it returns nil, so the build falls back to the public cache.
  • Cluster-DNS private servers (Linux) are unchanged — their in-cluster Service already drops a not-ready pod from its endpoints, so there’s no node-port heartbeat to consult.

Why

The runner-cache endpoint is a hard override: dispatch hands a private node’s URL to the build, and the client uses it with no fallback. But a private Server is marked :active once the controller observes the desired image and is never re-checked against /ready — unlike public servers, where activation keeps a live /up probe as the readiness authority (refresh_private_server_url even deliberately keeps the last URL through transient gaps to avoid flapping).

So when a node-port Kura pod is up but /ready-503 — e.g. the tuist cross-region mesh wedged on bootstrap replication and every node stuck Pending — dispatch kept routing builds at the dead endpoint. Observed in production CI as floods of Failed to download <Module> ... The request timed out on module-cache downloads, with no degradation to the public cache. Customer macOS runner builds hit the same path.

This is the resilience half of that incident: the kura mesh still has to recover (tracked separately), but the cache layer should fail soft, not time out.

Why a heartbeat + staleness window (not demote-on-first-miss)

external_endpoint returning observable doubles as the readiness signal (the controller only publishes the node-port endpoint for a ready primary pod). Stamping a heartbeat and gating on staleness respects the existing anti-flap intent — a single slow/missed reconcile tick (cron is 30s) keeps serving, but a sustained /ready-503 falls over within ~2 minutes. It also self-heals if the reconciler itself stalls: no stamps → goes stale → fall back to public (which always works).

The :active status enum is intentionally left alone (from :active you can only go :failed/:destroying, and :failed carries deploy-failure semantics + UI/alerts), so demoting on a transient readiness blip would be both semantically wrong and flappy.

Migration

add :last_ready_at, :timestamptz — nullable, stamped by the reconciler on the next ticks, safe on a live table. Existing active node-port servers read back nil (treated as stale) until the next reconcile stamps them, so there’s a brief post-deploy window where dispatch falls back to the public cache before healing — degraded, not broken.

Validation

  • mix test test/tuist/kura_test.exs test/tuist/kura/ test/tuist_web/controllers/runners_controller_test.exs — all green.
  • New tests: the reconciler heartbeats last_ready_at when the endpoint is observable and stops while it isn’t; runner_cache_endpoint_url/2 serves a fresh node-port endpoint, falls back when the heartbeat is stale or never set, and serves cluster-DNS private servers regardless of the heartbeat.
  • mix credo clean on the changed modules; mix excellent_migrations.check_safety clean for the new migration.
  • data-export.md updated to list the new observed-state column.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.