Hive
fix(kura): scope managed-region public hosts per environment
GitHub issue · Closed
What changed
Managed Kura regions now mint environment-scoped public hostnames. Non-production deployments insert an env suffix on the leftmost label (-staging / -canary); production is unchanged:
| Env | Host |
|---|---|
| production | acme-eu-central-1.kura.tuist.dev |
| staging | acme-eu-central-1-staging.kura.tuist.dev |
| canary | acme-eu-central-1-canary.kura.tuist.dev |
The suffix is derived from Tuist.Environment.env() and woven into the host templates in Tuist.Kura.Regions. Hosts stay inside the existing *.kura.tuist.dev Cloudflare zone, so no new DNS hierarchy is introduced.
Three coordinated changes:
regions.ex—{env_suffix}token in the managed-region public/gRPC host templates, filled per environment.kubernetes_controller.ex— bump@manifest_revisionso the reconciler re-applies already-provisionedKuraInstances with the new host.- Migration
20260613120000_prune_stale_kura_cache_endpoints— env-gated (staging/canary only) one-shot prune of the staleaccount_cache_endpointsrows the host change leaves behind.
Why / root cause
The host templates were hardcoded to kura.tuist.dev, and the only interpolated variables — account handle and cluster_id — are identical across environments. The managed eu-central region is exposed in staging, canary, and production (TUIST_KURA_AVAILABLE_REGIONS), and its cluster_id is eu-central-1 everywhere. So every environment produced the same hostname for a given account, e.g. tuist-eu-central-1.kura.tuist.dev.
external-dns syncs each cluster’s Ingress hosts into the shared tuist.dev Cloudflare zone, with per-cluster TXT ownership (txtOwnerId=<cluster>-platform, policy: sync). With the same record name desired by multiple clusters, the TXT registry makes the outcome deterministic: whichever cluster registered the record first owns it; the others see a foreign owner and back off. So the name resolved to a single cluster — production, which long predates staging exposing eu-central — and the staging/canary node was shadowed: its UI displayed the host, but the name resolved to production (a cross-environment data path). It did not round-robin or flap.
Why a template change alone wasn’t enough
- Reconciler drift detection keys on
@manifest_revision, not on a host diff. For an already-active node, the rendered revision still matched the live annotation, so the reconciler took the no-opconvergepath and never re-applied the new host. Bumping the revision forcesapply_current_manifest→ re-apply with the newpublicHost/grpcPublicHost. - Stale cache endpoints.
Tuist.Kura.ensure_cache_endpoint/2upserts withconflict_target: [:account_id, :technology, :url]— the URL is part of the key and the superseded row is never deleted. After the host flips, the account keeps two:kuraendpoints: the new suffixed one and the stale…-eu-central-1.kura.tuist.devone (which the CLI can still resolve → production). The migration prunes the unsuffixed rows on staging/canary.
User / developer impact
- Staging/canary Kura nodes get their own resolvable hostnames and stop colliding with production.
- Production behavior is unchanged (no suffix; its existing record stays correct).
- After deploy, the reconciler re-applies each node; external-dns publishes the new
-staging/-canaryrecord to that cluster’s regional LB and cert-manager issues the per-host cert. Brief not-serving window on the new host while DNS/cert propagate;activate_server’s/upprobe gates:active, so it self-heals.
Validation
-
mix test test/tuist/kura/regions_test.exs test/tuist/kura/provisioner/kubernetes_controller_test.exs test/tuist/kura/reconciler_test.exs→ 66 passing (added assertions for the-staging/-canarysuffix and the production no-suffix case; the manifest-revision assertion goes through the function, so the bump is safe). -
mix format --check-formattedandmix credoclean on all changed files. -
Migration runs clean (no-op in non-staging/canary envs). The prune predicate was validated against synthetic URLs, including the edge case of an account handle that itself contains
-staging:URL staging action https://tuist-eu-central-1.kura.tuist.devDELETE https://tuist-eu-central-1-staging.kura.tuist.devkeep https://my-staging-eu-central-1.kura.tuist.devDELETE https://my-staging-eu-central-1-staging.kura.tuist.devkeep https://acme-us-east-1.kura.tuist.devDELETE
Rollout & cleanup runbook
⚠️ The DB prune runs only on staging/canary (env-gated in the migration). Production legitimately uses
…-eu-central-1.kura.tuist.dev— never prune it there.
- Deploy (cascades to staging → canary → production). The migration prunes stale rows automatically; the reconciler re-applies each node within a tick or two.
- Confirm the live instances took the new host:
kubectl --context tuist-staging -n kura get kurainstances \-o custom-columns=NAME:.metadata.name,HOST:.spec.publicHost,REV:'.metadata.annotations.tuist\.dev/kura-manifest-revision'# HOST = *-eu-central-1-staging.kura.tuist.dev, REV = 2026-06-13-env-scoped-public-host-v1
- Confirm DNS / endpoint cutover:
dig +short tuist-eu-central-1-staging.kura.tuist.dev # -> staging kura-eu-central LB IPcurl -sS -o /dev/null -w '%{http_code}\n' https://tuist-eu-central-1-staging.kura.tuist.dev/up # 200dig +short tuist-eu-central-1.kura.tuist.dev # unchanged -> PRODUCTION LB (leave it)
- Audit Cloudflare for stragglers. external-dns (
policy: sync) auto-deletes the old record for any staging-only handle once its Ingress host flips, and leaves production’s record alone (it never owned it). Manually delete only records that are neither production’s legitimate one nor auto-reclaimed — check theexternal-dns/owner=TXT before deleting. Staging external-dns logs showDELETEfor records it cleaned andowner id does not matchfor production-owned records it deliberately left. - Post-checks: each affected account has exactly one
:kuraendpoint (the suffixed one);kura_servers.urlshows the suffixed host (self-healed on re-activation); the CLI resolves the staging cache to the staging node.
🤖 Generated with Claude Code
Update: endpoint regeneration + staging heal (commit de6b785)
The host rename changes a server’s public_url without changing its image, but the reconciler only re-derived kura_servers.url / the account_cache_endpoints mirror on an image change — so already-active nodes kept the old host, and once the prune migration ran they were left with no :kura endpoint. Two fixes close that gap:
- URL-aware convergence (
reconciler.ex): a server whose stored URL differs from the renderedpublic_urlis routed back through the endpoint-gatedactivate_server, which re-probes the new host and rewrites the mirror, then settles (no per-tick churn once they match). - Superseded-endpoint prune on activation (
kura.ex): scoped to the server’s previous URL, so the swap replaces the row instead of accumulating a second, and an account’s endpoints for other regions are untouched.
Staging note: staging was deployed with the first commit only, so it is currently half-applied (node on the new host, but the :kura endpoint pruned and not recreated). Deploying this commit heals it on the next reconciler tick — it re-activates, writes the suffixed endpoint, and updates kura_servers.url — provided the new …-staging host is serving. This commit must land before canary/production cascade, since canary hits the identical gap (production is safe: its host is unsuffixed, so public_url is unchanged and the migration is env-gated).
Validation: 146 Kura tests pass — added a reconciler URL-drift re-activation test and two kura.ex prune tests (superseded :kura row replaced; other-region :kura and user :default endpoints left intact); mix format / mix credo clean.