fix(kura): stabilize peer discovery and primary routing

Metadata

Source

tuist/tuist #11151

Updated

Jun 24, 2026

Domains

Kura

Details

Resolves N/A

This fixes Kura cache convergence after pod rollouts by making peer discovery and public primary selection depend on runtime mesh health, not only Kubernetes readiness.

The root cause was that HTTPS peer discovery could probe the headless Service as a single target. In mTLS mode that kept TLS verification correct, but it also let DNS or connection reuse keep returning a single backend, including the caller itself. A pod could then mark initial discovery complete, enter serving, and remain isolated with one ring member. The controller would still keep or select that pod as the public primary because it only checked Pod Ready.

What changed:

Kura now resolves the headless discovery DNS name dynamically on each membership pass and probes every returned socket address separately while preserving the DNS hostname for TLS/SNI.
Peer TLS client construction is factored into a reusable factory so discovery can build per-address clients without rereading certificate material.
The Kura controller now checks /status/rollout before keeping or selecting a public primary, requiring serving state, writer lock ownership, and enough ring visibility for the replica count.
Primary routing documentation and tests now cover runtime-isolated but Kubernetes-ready pods.

How to test locally

mise exec go@1.25.0 -- go test ./... from infra/kura-controller
mise exec rust@latest -- cargo test --manifest-path kura/Cargo.toml replication::tests
mise exec rust@latest -- cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings

Comments

TA

tuist-atlas[bot] Jun 9, 2026

The changes from this pull request are now available in Kura 0.7.4. Update to ghcr.io/tuist/kura:0.7.4 to use the stabilized peer discovery and primary routing.