Hive
feat(server): project Kura server status from observed cluster state
GitHub issue · Closed
Describe here the purpose of your PR.
Makes kura_servers.status a projection of observed cluster state instead of an independently-mutated state machine, which removes the stranded-:failed bug class by construction, and bundles a related infra fix.
What changed
- Projection model. New reconciler-written columns
observed_image_tag/last_observed_at. Each tick the reconciler projects every present-intent server the deployment loop did not handle: observe theKuraInstance, record the observation, and re-derivestatusfrom(latest deployment intent, observed image, endpoint readiness). Infra recovery is reflected with no out-of-band retry, so:failedis no longer a sticky terminal sink. - Heal-forward is correct for drift. The desired image is the latest deployment’s image, so a rollout the controller eventually applied heals forward even though it differs from what the server used to serve (the previously-stranded previously-active case).
- No flapping. A serving
:activeserver is not flipped on transient observation gaps; a previously-serving server whose newer rollout failed stays:failedwhile its old endpoint keeps serving until the cluster converges.Kura.fail_server/1is kept only as a same-tick UI fast path. - Cloudflare DNS-only LB. Endpoint readiness is end-to-end: a server is not projected
:activeuntil the regional endpoint and the Cloudflare-fronted global endpoint answer/up, so the recently added DNS-only proximity-steered LB is respected without the projection needing to know about Cloudflare. Pinned by a test. - UI. Settings Version column shows the observed image, so rollouts in flight and drift are visible.
- Infra fix (folded from #10852). The Server Kura Regional Deployment workflow failed on
mainwithmise ERROR no task k8s:deploy-kura-regionals found; that file task is only discovered wheninfra/mise.tomlis trusted, whichjdx/mise-actiondoes not do for nested configs. The step now trusts it. Reproduced/verified locally. - Simplification pass. Dropped dead
observed_ready_at; eliminated a recurring per-tick no-op write+broadcast for healthy servers; batched a per-server query into oneDISTINCT ON; deduped changeset validation; trimmed comments.
Deliberately not in scope
A kura-controller Ready condition in the Go controller. status.phase: "Ready" there attests only workload readiness, not public reachability; consuming it would mark servers :active before DNS/TLS/Cloudflare-LB propagation (a regression). The end-to-end probe in the projection is the correct readiness authority and already reconciles the Cloudflare DNS-only LB. A proper controller-attested readiness would have to incorporate Cloudflare LB pool health and is a separate, testable Go change.
Verification caveat
The server test DB is not bootstrapped in this environment, so the suite was not run locally. Everything compiles cleanly under MIX_ENV=test mix compile --warnings-as-errors, and the change is structured so the deployment loop still owns loop-touched servers (the projection skips them), keeping existing reconciler tests’ single-call expectations intact. CI is the gate.
How to test locally
- Server:
cd server && mix ecto.migrate && mix test test/tuist/kura/reconciler_test.exs test/tuist/kura_test.exs - Infra fix:
WS="$(git rev-parse --show-toplevel)"
mise trust --untrust "$WS/infra/mise.toml"
mise -C "$WS/infra" tasks ls | grep k8s:deploy-kura-regionals # not found
mise trust "$WS/infra/mise.toml"
mise -C "$WS/infra" tasks ls | grep k8s:deploy-kura-regionals # found
No GitHub comments yet.