feat(kura): consolidate regional deployment

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11192

Updated

Jun 24, 2026

Domains

Kura

Details

What Changed

Move managed Kura regions into the main Tuist workload clusters as dedicated regional node pools instead of standalone regional clusters.
Replace per-customer Hetzner LoadBalancer Services with shared regional Kura ingress gateways while preserving customer-specific Kura hostnames such as kura-cache-us-east.tuist.dev.
Add dedicated regional Kura ingress-nginx controllers:
- production: kura-eu-central in fsn1, kura-us-east in ash, and kura-us-west in hil.
- staging/canary: kura-eu-central in fsn1.
Route Kura instances through KuraInstance.spec.ingressClassName instead of the main Tuist web ingress.
Add a server-driven KuraGateway CRD for dedicated enterprise capacity. The controller reconciles the dedicated ingress-nginx Deployment, Hetzner LoadBalancer Service, IngressClass, and cleanup lifecycle.
Keep dedicated gateway names opaque and stable, for example kgw-<account-hash>-us-east, so customer names do not land in Helm values.
Promote hosted active Enterprise accounts to dedicated gateways through Entitlements.allows?(account, :dedicated_kura_gateway), with TUIST_KURA_DEDICATED_GATEWAY_ACCOUNTS as a manual override.
Move Kura public HTTPS/gRPC TLS termination to ingress-nginx and keep Kura runtime pods on in-cluster HTTP/gRPC plus peer mTLS.
Use an internal account-scoped peer Service for cross-region Kura discovery and remove public peer gateway/LB fields.
Source the Kura OAuth control-plane client id and secret from ESO in managed environments, using only the canonical KURA_CONTROL_PLANE_CLIENT_ID / KURA_CONTROL_PLANE_CLIENT_SECRET pair in Kubernetes manifests.
Remove regional Kura deployment automation, regional CAPI cluster manifests, regional kubeconfig sync, and regional platform overlays.
Validate staging with a temporary smoke workflow that provisions Kura through the live server, validates shared and dedicated gateway assignment, and performs authenticated public cache upload/download roundtrips.

Why

Kura regional clusters and one LB per customer were adding operational surface area before Kura is serving customer workflows. Since Kura regions can run as regional node pools under the main Tuist workload clusters, the simpler default is one umbrella cluster with regional Kura node pools and regional Kura gateways.

The important design correction is that Kura artifact bytes should not share the main Tuist web ingress dataplane. If all tenants go through the same web ingress-nginx controller, aggregate cache bandwidth is capped by that controller’s node NIC/CPU path before it reaches Kura. This PR keeps public LBs shared by default, but gives Kura its own regional ingress-nginx dataplane.

Bandwidth Model

The default traffic path becomes:

client -> regional Hetzner Kura LB -> regional Kura ingress-nginx -> selected Kura pod

It is no longer:

client -> main Tuist web ingress-nginx -> Kura pod

and it is not one default public LB per customer.

The regional Kura gateway is still in the byte path. The regional capacity ceiling is roughly:

min(
  Hetzner regional LB capacity,
  NIC/CPU of nodes running the regional Kura ingress pods,
  nginx worker capacity,
  CNI forwarding capacity,
  selected Kura pod/node capacity
)

The Kura ingress annotations disable request and response buffering (proxy-request-buffering: off, proxy-buffering: off, proxy-body-size: 0) so large artifacts stream through nginx instead of being spooled by the gateway.

Hetzner does not publish a guaranteed per-LB bandwidth number. Their LB docs say bandwidth is not explicitly limited, but LBs scheduled on the same node share a 2x10Gbit interface. For planning, one shared regional LB plus ingress-nginx shard should be treated as a measured single-digit-Gbps capacity unit until benchmarked, with low tens of Gbps as an optimistic ceiling rather than an SLO.

The latest observed regional peak is around 600 MB/s:

600 MB/s ~= 4,800 Mbps ~= 4.8 Gbps

That is already multi-Gbps traffic. It does not justify one public LB per customer by default, but it does mean regional gateway metrics and sharding need to exist before Kura becomes customer-critical:

1 Gbps effective regional shard   ~= 0.2x observed peak
5 Gbps effective regional shard   ~= 1.0x observed peak
10 Gbps effective regional shard  ~= 2.1x observed peak
20 Gbps theoretical LB interface  ~= 4.2x observed peak

Large enterprise cache bursts can exceed that quickly:

50 CI jobs * 500 MB over 2 minutes ~= 1.7 Gbps
100 CI jobs * 1 GB over 5 minutes  ~= 2.7 Gbps
500 CI jobs * 2 GB over 2 minutes  ~= 66 Gbps

Those bursts should drive gateway scaling, sharding, and dedicated capacity decisions. They should not make one-LB-per-customer the default topology.

Scaling Path

The intended scaling ladder is:

Shared regional Hetzner Kura LB plus shared dedicated regional Kura ingress-nginx class.
More regional Kura ingress-nginx replicas and larger or dedicated gateway nodes.
Additional regional Kura gateway shards when one regional LB/controller pair becomes the bottleneck.
Dedicated customer ingress class plus Hetzner LB when one customer regularly consumes a meaningful fraction of a shard or needs contractual isolation.

Multiple ingress-nginx replicas can sit behind the same regional LB and spread gateway work across more nodes. The cluster autoscaler can add Kura-pool nodes when extra gateway replicas cannot schedule. A future HPA/KEDA policy can scale gateway replicas from CPU or throughput metrics, but the first production version keeps the topology explicit and observable.

If one regional LB itself becomes the bottleneck, the next step is another regional shard: a second ingress class, another ingress-nginx controller, and another regional Hetzner LB. A dedicated customer endpoint is the same mechanism scoped to one account instead of a shard of accounts.

Dedicated Enterprise Capacity

This PR implements the shared default and the dedicated escape hatch.

Shared accounts get the regional ingress class:

ingressClassName: kura-us-east

Dedicated accounts get a server-created gateway resource with an opaque account-region id:

KuraGateway -> ingress-nginx Deployment + LoadBalancer Service + IngressClass
KuraInstance.spec.ingressClassName -> dedicated IngressClass

The platform chart installs only shared regional gateways and the generic controller/RBAC required to reconcile dynamic gateways. Customer-specific exposure is driven by server policy:

account + region -> shared gateway | dedicated gateway

The assignment logic is:

dedicated gateway =
  account handle override
  OR (Tuist-hosted AND Entitlements.allows?(account, :dedicated_kura_gateway))

Today, hosted active Enterprise plans receive the feature automatically. Non-Enterprise plans fail closed for unknown features.

TLS And Secret Simplification

With nginx in front of Kura, public TLS termination moves out of the Kura runtime:

public HTTPS/gRPC TLS: cert-manager -> ingress-nginx
Kura runtime: plain HTTP/gRPC inside the cluster
peer replication: Kura internal mTLS

Kura pods no longer mount public/gRPC TLS Secrets or receive KURA_PUBLIC_TLS_* / KURA_GRPC_TLS_* env vars. The runtime keeps only peer mTLS material for pod-to-pod replication.

The Kura OAuth introspection/control-plane client is now canonicalized around:

KURA_CONTROL_PLANE_CLIENT_ID
KURA_CONTROL_PLANE_CLIENT_SECRET

Managed deployments sync those fields from the kura-introspection 1Password item into kura-shared-secrets in both the Kura namespace and the server namespace. Helm no longer generates a fallback secret, no longer reads existing Secrets with lookup, and no longer stamps the legacy KURA_EXTENSION_TUIST_INTROSPECT_* or TUIST_KURA_INTROSPECTION_* aliases into managed Kubernetes manifests.

For self-hosted/local chart users, kuraController.sharedSecrets.kuraIntrospection.clientId and clientSecret can still be supplied inline. If introspection is enabled without inline values and without sharedSecrets.externalSecret.enabled, Helm fails fast.

Managed regions also stop setting a shared spec.peerTLSSecretName by default. That avoids mounting a missing shared secret and avoids the deeper correctness issue where a shared certificate would not match the current per-instance KURA_NODE_URL pod DNS names.

Latency Impact

Compared with the current per-customer LB shape, this PR adds one in-cluster nginx hop:

Before: client -> customer Hetzner LB -> Kura pod
After:  client -> regional Hetzner LB -> regional Kura ingress-nginx -> Kura pod

Because the Kura ingress controllers are pinned to the same regional Kura node pools, that extra hop stays inside the regional cluster path. At healthy utilization, latency impact should be small relative to client-to-region RTT and sustained transfer time.

The bigger latency risk is saturation. If a shared regional gateway approaches its NIC/CPU/nginx limit, p95/p99 latency and tail transfer time degrade for every tenant on that shard. That is why Kura gets a dedicated regional ingress dataplane instead of sharing the main web ingress, and why the scaling path moves from replicas, to shards, to dedicated customer endpoints.

How Similar Kubernetes Services Usually Solve This

Bandwidth-heavy multi-tenant Kubernetes services usually centralize public ingress by region rather than giving every tenant a cloud LB:

A small number of regional Envoy/nginx/HAProxy/Gateway API dataplanes own public IPs.
Tenant identity routes by host, SNI, path, or auth metadata.
Workloads stay behind ClusterIP/headless Services and are selected by the gateway/controller.
Gateway capacity scales through replicas, node pool size, gateway shards, or separate ingress classes.
Dedicated public endpoints are reserved for isolation, extreme traffic, or enterprise contract cases.

Object-storage-like systems often push large transfers directly to storage with signed URLs. Kura is different because the node participates in cache lookup, peer replication, and serving. The Kubernetes-native default for us is therefore a regional Kura gateway in front of Kura pods, with streaming enabled and buffering disabled. A future direct-transfer design could remove the gateway from the artifact byte path if benchmarks show that hop is the bottleneck.

Control Plane Tradeoff

Flattening Kura regions into tuist-k8s-production also flattens the control-plane blast radius. The regional data plane remains regional through node pools, Services, LBs, and ingress classes, but the Kubernetes API/control plane for reconciliation is now shared. If the production cluster control plane or its primary region is unavailable, Kura rollouts, new gateway assignments, endpoint updates, and manual operational changes can be blocked across all Kura regions even if already-running pods continue serving.

That is an intentional tradeoff for this phase because Kura is not yet customer-critical and the standalone regional clusters were adding more operational cost than resilience. Once Kura is on customer workflows, we should revisit whether the production cluster control plane needs stronger multi-region guarantees or whether Kura should be split again along a smaller number of regional failure domains.

The platform install path no longer derives the platform ingress LB region from Nodes. Managed platform overlays pin the main platform ingress LB explicitly to fsn1, matching the general worker pools, while Kura regional LBs are pinned separately to their regional pools.

Operational Rollout

Kura is not currently used for customer workflows, so this can ship as one deploy instead of a gradual customer migration.

The PR removes the regional cluster manifests and automation, but it does not directly delete already-created infrastructure. After production deploy, the manual ops checklist is:

Verify production created the regional Kura ingress controllers and LoadBalancers in the main production cluster.
Verify Kura instances exist in the main production cluster on the new Kura node pools.
Verify customer hostnames resolve through the matching regional Kura ingress and /ready responds.
Delete old Kura resources and LB Services in the old regional clusters so Hetzner LBs and external-dns records are cleaned up.
Delete the old regional CAPI clusters from management.
Verify no orphaned Hetzner LBs, volumes, servers, or Cloudflare DNS/TXT records remain.

Existing deployed Kura nodes are refreshed by a Kura manifest revision bump. That forces the reconciler to reapply the desired KuraInstance spec under the new umbrella-cluster topology and picks up the new shared introspection secret wiring.

Managed deploys now also require the kura-introspection external secret item to exist with fields KURA_CONTROL_PLANE_CLIENT_ID and KURA_CONTROL_PLANE_CLIENT_SECRET before Helm waits on ESO.

Staging Issue Found And Fixed Earlier In The PR

The first public Kura smoke reached both regional and dedicated gateways, but cache requests returned 503 Authentication backend unavailable. The gateway and Kura pods were reachable; /ready was green.

The root cause was the default Kura OAuth client id. The chart used kura-control-plane, while Boruta validates OAuth client_id values as UUIDs. Tuist therefore returned 400 invalid_request on /oauth2/introspect, which Kura correctly treated as an unavailable authentication backend.

The earlier fix made the client id UUID-shaped and explicit. The latest review follow-up removes the remaining sentinel/default behavior from managed values entirely: the id and secret now come from ESO rather than a Helm default or generated Secret.

Latest Review Follow-up

Addressed the latest inline review by:

removing platform ingress LB region derivation entirely; managed platform overlays now pin the main ingress location explicitly to fsn1, matching the general worker pools.
documenting the shared control-plane blast radius in this PR description.
replacing the Kura introspection lookup/randAlphaNum fallback with explicit inline values or ESO-sourced values.
reducing managed Kura OAuth env wiring to the canonical KURA_CONTROL_PLANE_CLIENT_ID / KURA_CONTROL_PLANE_CLIENT_SECRET pair.
collapsing the managed region constructors into a data table and adding a drift test that cross-checks the server catalog against platform ingress values and production node-pool labels.
persisting KuraGateway.status.ingressClassName immediately after the cluster-scoped IngressClass is reconciled.
making legacy gRPC/peer Service cleanup read-before-delete instead of issuing repeated expected DELETE 404s.
collapsing duplicate gateway workload name helpers into one gatewayWorkloadName helper.
simplifying Kubernetes client wrapper calls now that Tuist.Kubernetes.Client defaults opts internally.

Impact

Production deploys carry Kura controller/runtime/platform changes through the main server deployment path.
Regional Kura nodes scale as main-cluster regional node pools.
Kura traffic goes through dedicated shared regional Kura ingress-nginx LBs with buffering disabled for artifact streaming.
Hosted Enterprise accounts can get server-driven dedicated Kura ingress-nginx plus Hetzner LB infrastructure immediately, without customer-specific Helm values.
Server pods no longer need regional Kura kubeconfig secrets.
Existing Kura server rows with succeeded intent can be recreated in the main cluster if their backing KuraInstance is missing.

Validation

Latest review follow-up:

Ran gofmt on the touched Kura controller files.
Ran go test ./... in infra/kura-controller.
Ran mix format for touched Elixir files.
Ran mix test test/tuist/kura/provisioner/kubernetes_controller_test.exs test/tuist/kura/regions_test.exs successfully: 40 tests, 0 failures.
Rendered the managed staging Tuist chart with dummy render-only runner/image overrides.
Rendered templates/kura-controller.yaml and verified the managed kura-shared-secrets ExternalSecret exists in both kura and server namespaces and syncs only canonical Kura control-plane keys.
Rendered templates/server-deployment.yaml and verified the server reads only KURA_CONTROL_PLANE_CLIENT_ID and KURA_CONTROL_PLANE_CLIENT_SECRET from kura-shared-secrets.
Ran git diff --check successfully.
Ran bash -n infra/mise/tasks/k8s/install-platform.sh successfully after removing node-derived LB region logic.
Ran helm lint infra/helm/platform with the production, staging, and canary platform overlays.
Rendered the main platform ingress Service for production, staging, and canary and verified it keeps load-balancer.hetzner.cloud/location: fsn1 plus the deterministic cluster LB name.

Earlier validation in this PR:

Ran focused server tests successfully:
- mix test test/tuist/oauth/clients_test.exs test/tuist_web/controllers/internal/kura_usage_controller_test.exs test/tuist_web/controllers/oauth/introspect_controller_test.exs test/tuist/kura/provisioner/kubernetes_controller_test.exs test/tuist/kura/reconciler_test.exs test/tuist/billing/entitlements_test.exs
- Result: 65 tests, 0 failures.
Ran Helm lint successfully for staging, canary, and production Tuist managed overlays.
Ran the Kura staging smoke as a temporary tuist-ops.yml workflow on throwaway runner branches so the PR does not add a permanent manual workflow.
Deployed staging successfully after the generated shared-secret change: https://github.com/tuist/tuist/actions/runs/27245069807
Deployed staging successfully after the manifest refresh change: https://github.com/tuist/tuist/actions/runs/27246779457
Deployed staging successfully after the UUID OAuth client id fix: https://github.com/tuist/tuist/actions/runs/27247729105
Verified staging /oauth2/introspect accepted the UUID-shaped client id request shape and reached credential validation (401 invalid_client with a bogus secret, instead of 400 invalid_request).
Ran the Kura staging public smoke workflow successfully for both matrix paths: https://github.com/tuist/tuist/actions/runs/27248316373
- Public Kura cache roundtrip (regional): success.
- Public Kura cache roundtrip (dedicated): success.
Verified outside-in bogus-token probes returned 401 Invalid or expired token from both regional and dedicated Kura endpoints instead of 503 Authentication backend unavailable.

Comments

tuist-atlas[bot] Jun 11, 2026

The changes from this pull request are now available in version xcresult-processor-image@0.17.0. Update to this version to use these changes.