Hive Hive
Sign in

fix(kura): scope managed-region public hosts per environment

GitHub issue · Closed

Metadata
Source
tuist/tuist #11270
Updated
Jun 24, 2026
Domains
Kura
Details

What changed

Managed Kura regions now mint environment-scoped public hostnames. Non-production deployments insert an env suffix on the leftmost label (-staging / -canary); production is unchanged:

Env Host
production acme-eu-central-1.kura.tuist.dev
staging acme-eu-central-1-staging.kura.tuist.dev
canary acme-eu-central-1-canary.kura.tuist.dev

The suffix is derived from Tuist.Environment.env() and woven into the host templates in Tuist.Kura.Regions. Hosts stay inside the existing *.kura.tuist.dev Cloudflare zone, so no new DNS hierarchy is introduced.

Three coordinated changes:

  1. regions.ex{env_suffix} token in the managed-region public/gRPC host templates, filled per environment.
  2. kubernetes_controller.ex — bump @manifest_revision so the reconciler re-applies already-provisioned KuraInstances with the new host.
  3. Migration 20260613120000_prune_stale_kura_cache_endpoints — env-gated (staging/canary only) one-shot prune of the stale account_cache_endpoints rows the host change leaves behind.

Why / root cause

The host templates were hardcoded to kura.tuist.dev, and the only interpolated variables — account handle and cluster_id — are identical across environments. The managed eu-central region is exposed in staging, canary, and production (TUIST_KURA_AVAILABLE_REGIONS), and its cluster_id is eu-central-1 everywhere. So every environment produced the same hostname for a given account, e.g. tuist-eu-central-1.kura.tuist.dev.

external-dns syncs each cluster’s Ingress hosts into the shared tuist.dev Cloudflare zone, with per-cluster TXT ownership (txtOwnerId=<cluster>-platform, policy: sync). With the same record name desired by multiple clusters, the TXT registry makes the outcome deterministic: whichever cluster registered the record first owns it; the others see a foreign owner and back off. So the name resolved to a single cluster — production, which long predates staging exposing eu-central — and the staging/canary node was shadowed: its UI displayed the host, but the name resolved to production (a cross-environment data path). It did not round-robin or flap.

Why a template change alone wasn’t enough

  • Reconciler drift detection keys on @manifest_revision, not on a host diff. For an already-active node, the rendered revision still matched the live annotation, so the reconciler took the no-op converge path and never re-applied the new host. Bumping the revision forces apply_current_manifest → re-apply with the new publicHost/grpcPublicHost.
  • Stale cache endpoints. Tuist.Kura.ensure_cache_endpoint/2 upserts with conflict_target: [:account_id, :technology, :url] — the URL is part of the key and the superseded row is never deleted. After the host flips, the account keeps two :kura endpoints: the new suffixed one and the stale …-eu-central-1.kura.tuist.dev one (which the CLI can still resolve → production). The migration prunes the unsuffixed rows on staging/canary.

User / developer impact

  • Staging/canary Kura nodes get their own resolvable hostnames and stop colliding with production.
  • Production behavior is unchanged (no suffix; its existing record stays correct).
  • After deploy, the reconciler re-applies each node; external-dns publishes the new -staging/-canary record to that cluster’s regional LB and cert-manager issues the per-host cert. Brief not-serving window on the new host while DNS/cert propagate; activate_server’s /up probe gates :active, so it self-heals.

Validation

  • mix test test/tuist/kura/regions_test.exs test/tuist/kura/provisioner/kubernetes_controller_test.exs test/tuist/kura/reconciler_test.exs66 passing (added assertions for the -staging/-canary suffix and the production no-suffix case; the manifest-revision assertion goes through the function, so the bump is safe).

  • mix format --check-formatted and mix credo clean on all changed files.

  • Migration runs clean (no-op in non-staging/canary envs). The prune predicate was validated against synthetic URLs, including the edge case of an account handle that itself contains -staging:

    URL staging action
    https://tuist-eu-central-1.kura.tuist.dev DELETE
    https://tuist-eu-central-1-staging.kura.tuist.dev keep
    https://my-staging-eu-central-1.kura.tuist.dev DELETE
    https://my-staging-eu-central-1-staging.kura.tuist.dev keep
    https://acme-us-east-1.kura.tuist.dev DELETE

Rollout & cleanup runbook

⚠️ The DB prune runs only on staging/canary (env-gated in the migration). Production legitimately uses …-eu-central-1.kura.tuist.dev — never prune it there.

  1. Deploy (cascades to staging → canary → production). The migration prunes stale rows automatically; the reconciler re-applies each node within a tick or two.
  2. Confirm the live instances took the new host:
    kubectl --context tuist-staging -n kura get kurainstances \
    -o custom-columns=NAME:.metadata.name,HOST:.spec.publicHost,REV:'.metadata.annotations.tuist\.dev/kura-manifest-revision'
    # HOST = *-eu-central-1-staging.kura.tuist.dev, REV = 2026-06-13-env-scoped-public-host-v1
  3. Confirm DNS / endpoint cutover:
    dig +short tuist-eu-central-1-staging.kura.tuist.dev # -> staging kura-eu-central LB IP
    curl -sS -o /dev/null -w '%{http_code}\n' https://tuist-eu-central-1-staging.kura.tuist.dev/up # 200
    dig +short tuist-eu-central-1.kura.tuist.dev # unchanged -> PRODUCTION LB (leave it)
  4. Audit Cloudflare for stragglers. external-dns (policy: sync) auto-deletes the old record for any staging-only handle once its Ingress host flips, and leaves production’s record alone (it never owned it). Manually delete only records that are neither production’s legitimate one nor auto-reclaimed — check the external-dns/owner= TXT before deleting. Staging external-dns logs show DELETE for records it cleaned and owner id does not match for production-owned records it deliberately left.
  5. Post-checks: each affected account has exactly one :kura endpoint (the suffixed one); kura_servers.url shows the suffixed host (self-healed on re-activation); the CLI resolves the staging cache to the staging node.

🤖 Generated with Claude Code


Update: endpoint regeneration + staging heal (commit de6b785)

The host rename changes a server’s public_url without changing its image, but the reconciler only re-derived kura_servers.url / the account_cache_endpoints mirror on an image change — so already-active nodes kept the old host, and once the prune migration ran they were left with no :kura endpoint. Two fixes close that gap:

  • URL-aware convergence (reconciler.ex): a server whose stored URL differs from the rendered public_url is routed back through the endpoint-gated activate_server, which re-probes the new host and rewrites the mirror, then settles (no per-tick churn once they match).
  • Superseded-endpoint prune on activation (kura.ex): scoped to the server’s previous URL, so the swap replaces the row instead of accumulating a second, and an account’s endpoints for other regions are untouched.

Staging note: staging was deployed with the first commit only, so it is currently half-applied (node on the new host, but the :kura endpoint pruned and not recreated). Deploying this commit heals it on the next reconciler tick — it re-activates, writes the suffixed endpoint, and updates kura_servers.url — provided the new …-staging host is serving. This commit must land before canary/production cascade, since canary hits the identical gap (production is safe: its host is unsuffixed, so public_url is unchanged and the migration is env-gated).

Validation: 146 Kura tests pass — added a reconciler URL-drift re-activation test and two kura.ex prune tests (superseded :kura row replaced; other-region :kura and user :default endpoints left intact); mix format / mix credo clean.

Comments
TA
tuist-atlas[bot] Jun 16, 2026

The fix for scoping managed-region public hosts per environment is now available in xcresult-processor-image@0.21.1. Update to this version to apply the changes.