Hive Hive
Sign in

feat(server): project Kura server status from observed cluster state

GitHub issue · Closed

Metadata
Source
tuist/tuist #10851
Updated
Jun 24, 2026
Domains
Kura
Details

Describe here the purpose of your PR.

Makes kura_servers.status a projection of observed cluster state instead of an independently-mutated state machine, which removes the stranded-:failed bug class by construction, and bundles a related infra fix.

What changed

  • Projection model. New reconciler-written columns observed_image_tag / last_observed_at. Each tick the reconciler projects every present-intent server the deployment loop did not handle: observe the KuraInstance, record the observation, and re-derive status from (latest deployment intent, observed image, endpoint readiness). Infra recovery is reflected with no out-of-band retry, so :failed is no longer a sticky terminal sink.
  • Heal-forward is correct for drift. The desired image is the latest deployment’s image, so a rollout the controller eventually applied heals forward even though it differs from what the server used to serve (the previously-stranded previously-active case).
  • No flapping. A serving :active server is not flipped on transient observation gaps; a previously-serving server whose newer rollout failed stays :failed while its old endpoint keeps serving until the cluster converges. Kura.fail_server/1 is kept only as a same-tick UI fast path.
  • Cloudflare DNS-only LB. Endpoint readiness is end-to-end: a server is not projected :active until the regional endpoint and the Cloudflare-fronted global endpoint answer /up, so the recently added DNS-only proximity-steered LB is respected without the projection needing to know about Cloudflare. Pinned by a test.
  • UI. Settings Version column shows the observed image, so rollouts in flight and drift are visible.
  • Infra fix (folded from #10852). The Server Kura Regional Deployment workflow failed on main with mise ERROR no task k8s:deploy-kura-regionals found; that file task is only discovered when infra/mise.toml is trusted, which jdx/mise-action does not do for nested configs. The step now trusts it. Reproduced/verified locally.
  • Simplification pass. Dropped dead observed_ready_at; eliminated a recurring per-tick no-op write+broadcast for healthy servers; batched a per-server query into one DISTINCT ON; deduped changeset validation; trimmed comments.

Deliberately not in scope

A kura-controller Ready condition in the Go controller. status.phase: "Ready" there attests only workload readiness, not public reachability; consuming it would mark servers :active before DNS/TLS/Cloudflare-LB propagation (a regression). The end-to-end probe in the projection is the correct readiness authority and already reconciles the Cloudflare DNS-only LB. A proper controller-attested readiness would have to incorporate Cloudflare LB pool health and is a separate, testable Go change.

Verification caveat

The server test DB is not bootstrapped in this environment, so the suite was not run locally. Everything compiles cleanly under MIX_ENV=test mix compile --warnings-as-errors, and the change is structured so the deployment loop still owns loop-touched servers (the projection skips them), keeping existing reconciler tests’ single-call expectations intact. CI is the gate.

How to test locally

  • Server: cd server && mix ecto.migrate && mix test test/tuist/kura/reconciler_test.exs test/tuist/kura_test.exs
  • Infra fix:
WS="$(git rev-parse --show-toplevel)"
mise trust --untrust "$WS/infra/mise.toml"
mise -C "$WS/infra" tasks ls | grep k8s:deploy-kura-regionals # not found
mise trust "$WS/infra/mise.toml"
mise -C "$WS/infra" tasks ls | grep k8s:deploy-kura-regionals # found
Comments

No GitHub comments yet.