Hive Hive
Sign in

fix(kura): keep fresh replicas out of primary routing

GitHub issue · Closed

Metadata
Source
tuist/tuist #11190
Updated
Jun 24, 2026
Domains
Kura
Details

Fixes Kura cache misses after replica churn by keeping large upload staging on each pod’s persistent data volume, bounding that staging inside Kura, and preventing freshly restarted replicas from becoming the public primary before they have time to bootstrap from peers.

The incident showed three related failure modes: multipart assembly could exceed the 4 GiB /tmp/kura emptyDir limit and evict pods, the replacement pod could then be selected for public cache traffic before its replicated artifact set had caught up, and increasing only the Kubernetes tmp size would move the failure threshold without preserving Kura’s bounded resource model.

This changes managed Kura instances, the standalone chart, and the local e2e compose runtime to stage temporary data under /var/cache/kura/tmp on the per-pod data volume, removes the tmp emptyDir, and introduces KURA_TMP_DIR_MAX_BYTES as an application-level staging budget. HTTP uploads, replication bootstrap, REAPI ByteStream writes, and multipart assembly now check that budget and return backpressure instead of relying on the PVC or kubelet as the first limit. The Helm chart exposes the budget as config.tmpDirMaxBytes, defaulting to 8 GiB.

It also adds a 10 minute primary-routing age gate for replicated instances. Single-replica instances remain routable immediately, while multi-replica instances avoid selecting freshly restarted pods as the public primary.

The new e2e regression runs Kura with a deliberately small tmp staging budget, posts through the public cache API, and asserts Kura returns 503 backpressure while the process remains healthy with no restart.

How to test locally

  • mise exec -- cargo check from kura
  • mise exec -- cargo test from_lookup from kura
  • mise exec -- cargo test read_request_to_temp from kura
  • mise run test-e2e -- spec/e2e/tmp_budget_spec.sh from kura
  • mise exec go -- go test ./controllers from infra/kura-controller
  • mise exec helm -- helm template kura kura/ops/helm/kura
  • git diff --check
Comments
TA
tuist-atlas[bot] Jun 10, 2026

The fix to keep fresh replicas out of primary routing is now available in kura@0.8.4. Update to that version to apply this change.