Hive
fix(kura): stabilize peer discovery and primary routing
GitHub issue · Closed
Source
tuist/tuist #11151
Updated
Jun 24, 2026
Domains
Kura
Resolves N/A
This fixes Kura cache convergence after pod rollouts by making peer discovery and public primary selection depend on runtime mesh health, not only Kubernetes readiness.
The root cause was that HTTPS peer discovery could probe the headless Service as a single target. In mTLS mode that kept TLS verification correct, but it also let DNS or connection reuse keep returning a single backend, including the caller itself. A pod could then mark initial discovery complete, enter serving, and remain isolated with one ring member. The controller would still keep or select that pod as the public primary because it only checked Pod Ready.
What changed:
- Kura now resolves the headless discovery DNS name dynamically on each membership pass and probes every returned socket address separately while preserving the DNS hostname for TLS/SNI.
- Peer TLS client construction is factored into a reusable factory so discovery can build per-address clients without rereading certificate material.
- The Kura controller now checks
/status/rolloutbefore keeping or selecting a public primary, requiring serving state, writer lock ownership, and enough ring visibility for the replica count. - Primary routing documentation and tests now cover runtime-isolated but Kubernetes-ready pods.
How to test locally
mise exec go@1.25.0 -- go test ./...frominfra/kura-controllermise exec rust@latest -- cargo test --manifest-path kura/Cargo.toml replication::testsmise exec rust@latest -- cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings