Hive Hive
Sign in

feat(server): self-hosted Kura nodes with per-account mesh CA enrollment

GitHub issue · Closed

Metadata
Source
tuist/tuist #11294
Updated
Jun 24, 2026
Domains
Kura
Details

Lets a customer run their own Kura cache nodes, authorized by Tuist’s control plane, and bridge them into the same mutually-authenticated mesh as their Tuist-managed nodes. Typically the nodes sit next to a customer’s runners, so writes against the self-hosted nodes replicate into the shared managed cache, and a node warms from that cache when it joins. A customer never has to choose between owning the whole mesh and running a single node; spinning one up is a credential and a URL.

Why

Customers want Kura cache nodes in their own infrastructure (locality, data residency). The control plane should own authentication, certificate issuance, and discovery so the customer supplies only a credential and a URL.

Two load-bearing constraints shaped the design:

  • Kura’s hot-path auth first tries to verify a Tuist Guardian JWT locally using a symmetric HS512 secret. That secret can mint tokens for any tenant, so it cannot be handed to a customer. Self-hosted nodes run without the verifier and authorize uncached requests via the control plane’s introspection endpoint; making that introspection tenant-scoped is the security boundary.
  • Kura’s peer mTLS verifier trusts any client cert that chains to the configured CA, with no identity check. The CA is the trust boundary, so isolation comes from a per-account CA.

What changed

Control-plane authorization

  • Tenant-scoped credential: per-account kura_self_hosted_clients (Bcrypt at rest with the standard pepper, plaintext shown once, a suffix-masked hint thereafter), issued and revoked from the account’s Cache page.
  • Per-account introspection and usage constraint: both endpoints branch the unconstrained Tuist control-plane client from a tenant-scoped client, so a self-hosted node can only introspect its own tenant’s tokens and report its own usage.
  • Lease-based registration: self-hosted nodes self-register their client-facing URL via heartbeats, and that URL is the endpoint the CLI routes to (resolved under the :kura technology). A node ages out of rotation when heartbeats stop. There is no manually-managed endpoint list; the registered node is the endpoint.
  • Cache-API authentication: the Tuist auth hook ships in the Kura image, so a self-hosted node enables it with KURA_EXTENSION_* to require a valid Tuist token on every cache request, authorized via tenant-scoped introspection (no symmetric verifier secret, fail-closed). Covered by introspection-only tests against the real hook; managed pods keep their ConfigMap-mounted hook, which shadows the bundled file.

Enrollment (controller-managed per-account CA)

  • The Kura controller owns the per-account peer CA (the kura-<handle>-peer-ca secret). Mesh.enroll_node and sign_node_certificate read that secret and sign the node’s CSR from it with an issuer-controlled SAN, so the node’s leaf is trusted by the managed pods (same CA) and vice versa. The enroll endpoint returns 503 ca_unavailable when the account has no managed mesh yet.
  • Kura enroll-on-boot generates the keypair locally (the private key never leaves the node), enrolls, writes the cert material to the KURA_INTERNAL_TLS_* paths, and adopts the returned tenant and peers.

Two-way bridge Managed-mesh discovery advertises each pod’s in-cluster *.svc.cluster.local peer URL, which an off-cluster node cannot reach. The bridge closes that gap:

  • Public peer plane (controller): new KuraInstance fields meshPublicPeerHost, meshExternalPeers, and meshPublicPeerLoadBalancerAnnotations. reconcileAccountPublicPeerService provisions an account-level L4 LoadBalancer on the peer port (TLS-passthrough, externalTrafficPolicy: Local, hcloud location/node-selector plus external-dns annotations); the peer-cert SAN covers the public host; the NetworkPolicy admits the peer port from 0.0.0.0/0 when public peering is on (the mutual-TLS client cert is the auth boundary).
  • Gateway advertisement (Kura): managed pods set KURA_PEER_GATEWAY_URL, and internal_status returns the gateway URL to any request that arrived via the gateway host, read from the HTTP/2 :authority and falling back to the h1 Host header. A node skips a discovered peer whose node_url is its own gateway, so same-region managed pods do not hairpin through the public LoadBalancer.
  • Server: the provisioner sets these fields per mesh region, and Mesh.mesh_peers seeds the gateway URL into the enrollment peer list (Regions.peer_public_url).

Net: a self-hosted node enrolls, receives the gateway as its peer, pushes its writes through it into the managed mesh (which re-replicates internally), and warms from that mesh at join, all without ever accepting an inbound connection.

Replication model and scope

Replication is point-to-point push of each node’s own client writes: on a write, a node enqueues and pushes the artifact to its configured peers, while artifacts it receives are applied locally and not forwarded (there is no gossip). Self-hosted nodes typically sit next to a customer’s runners, which do most of the writing, so the directions that matter are covered:

  • Self-hosted to self-hosted (continuous). Each node pushes its writes to its sibling peers, so a customer’s own mesh stays consistent. The nodes are configured to peer with each other.
  • Self-hosted to Tuist-managed (continuous). Each enrolled node pushes its writes through the gateway into the managed mesh, which re-replicates internally. Because the nodes are outbound-only, managed cannot pull from them; the data flows by the nodes pushing. Each write-ingesting node peers with the gateway (with no gossip, a write on a non-bridged node does not reach managed transitively).
  • Tuist-managed to self-hosted (join-time snapshot). A node warms from the managed mesh when it joins. Continuously propagating new managed-side writes to an already-joined node is a deferred follow-up; it is not required for the runner-adjacent use case where the self-hosted nodes are the write origin.

Security model

Isolation is the per-account CA described above: a node’s leaf chains only to its account’s CA, so it can join only that account’s mesh and cannot present a cert any other account’s nodes would accept. Properties that hold in both topologies:

  • The node’s private key is generated on the node and never leaves it (enrollment sends only a CSR).
  • Self-hosted nodes never receive the symmetric Guardian verifier secret (which could mint tokens for any tenant); cache-API auth is tenant-scoped introspection, fail-closed.
  • Nodes are outbound-only and accept no inbound connections.

What crosses into Tuist depends on the topology, and customers should pick deliberately:

  • Standalone (own CA, own mesh, no managed region): artifact bytes never leave the customer’s infrastructure. Tuist receives only the control-plane metadata it needs to function (registration heartbeats, usage events, token introspection). This is the zero-trust path.
  • Bridged (Tuist-managed CA + node-to-managed push): by design, artifacts a node ingests are pushed into the managed mesh, so their bytes and manifests (keys, namespaces, sizes) reside in Tuist’s infrastructure. The per-account CA is also Tuist-managed (the server reads it to sign CSRs). Because nodes are outbound-only, a compromise of that CA key cannot pull a node’s local cache; its worst case is integrity (a malicious gateway serving poisoned artifacts to a node that pulls from it), not exfiltration. A customer who requires that the vendor be cryptographically unable to touch their cache should run standalone, or a future customer-held-CA option for bridged.

End-to-end validation (staging)

A locally-run node was deployed against the real staging managed mesh. It enrolled, got an account-CA-signed leaf (issuer is the account’s peer CA), joined, and reached state: serving.

  • Warm-up pull (managed to self-hosted, at join): the node pulled the account’s full mesh as its bootstrap snapshot; its data directory grew to ~6.8 GB.
  • Push (self-hosted to managed, continuous): artifacts written through the node were pushed to the gateway and accepted by the managed mesh (upsert_artifact result=ok, with the replication counter incrementing across writes).

Caveats and follow-ups

  • Continuous managed-to-node propagation is a deferred follow-up. As described under Replication model and scope, a self-hosted node warms from the managed mesh at join (validated above) but does not yet continuously pull new managed-side writes. The node and node-to-managed directions, which carry the runner-adjacent use case, work via push and are the supported scope here.

Docs

server/priv/docs/en/guides/cache/self-host.md covers Kura mesh self-hosting (superseding the Elixir cache instructions) for both topologies: bridged (managed mesh plus self-hosted nodes) and standalone (self-hosted only), including what enrollment auto-provisions (the peer TLS) versus what the customer provides.

Validation

  • Server: mix test test/tuist/kura/ test/tuist/oauth/introspection_test.exs test/tuist_web/controllers/internal/kura_mesh_controller_test.exs test/tuist_web/controllers/internal/kura_usage_controller_test.exs test/tuist_web/live/cache_live_test.exs.
  • End-to-end on staging as described above (enroll to serving, two-way replication).
Comments

No GitHub comments yet.