feat(server): self-hosted Kura nodes with per-account mesh CA enrollment

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11294

Updated

Jun 24, 2026

Domains

Kura

Details

Lets a customer run their own Kura cache nodes, authorized by Tuist’s control plane, and bridge them into the same mutually-authenticated mesh as their Tuist-managed nodes. Typically the nodes sit next to a customer’s runners, so writes against the self-hosted nodes replicate into the shared managed cache, and a node warms from that cache when it joins. A customer never has to choose between owning the whole mesh and running a single node; spinning one up is a credential and a URL.

Why

Customers want Kura cache nodes in their own infrastructure (locality, data residency). The control plane should own authentication, certificate issuance, and discovery so the customer supplies only a credential and a URL.

Two load-bearing constraints shaped the design:

Kura’s hot-path auth first tries to verify a Tuist Guardian JWT locally using a symmetric HS512 secret. That secret can mint tokens for any tenant, so it cannot be handed to a customer. Self-hosted nodes run without the verifier and authorize uncached requests via the control plane’s introspection endpoint; making that introspection tenant-scoped is the security boundary.
Kura’s peer mTLS verifier trusts any client cert that chains to the configured CA, with no identity check. The CA is the trust boundary, so isolation comes from a per-account CA.

What changed

Control-plane authorization

Tenant-scoped credential: per-account kura_self_hosted_clients (Bcrypt at rest with the standard pepper, plaintext shown once, a suffix-masked hint thereafter), issued and revoked from the account’s Cache page.
Per-account introspection and usage constraint: both endpoints branch the unconstrained Tuist control-plane client from a tenant-scoped client, so a self-hosted node can only introspect its own tenant’s tokens and report its own usage.
Lease-based registration: self-hosted nodes self-register their client-facing URL via heartbeats, and that URL is the endpoint the CLI routes to (resolved under the :kura technology). A node ages out of rotation when heartbeats stop. There is no manually-managed endpoint list; the registered node is the endpoint.
Cache-API authentication: the Tuist auth hook ships in the Kura image, so a self-hosted node enables it with KURA_EXTENSION_* to require a valid Tuist token on every cache request, authorized via tenant-scoped introspection (no symmetric verifier secret, fail-closed). Covered by introspection-only tests against the real hook; managed pods keep their ConfigMap-mounted hook, which shadows the bundled file.

Enrollment (controller-managed per-account CA)

The Kura controller owns the per-account peer CA (the kura-<handle>-peer-ca secret). Mesh.enroll_node and sign_node_certificate read that secret and sign the node’s CSR from it with an issuer-controlled SAN, so the node’s leaf is trusted by the managed pods (same CA) and vice versa. The enroll endpoint returns 503 ca_unavailable when the account has no managed mesh yet.
Kura enroll-on-boot generates the keypair locally (the private key never leaves the node), enrolls, writes the cert material to the KURA_INTERNAL_TLS_* paths, and adopts the returned tenant and peers.

Two-way bridge Managed-mesh discovery advertises each pod’s in-cluster *.svc.cluster.local peer URL, which an off-cluster node cannot reach. The bridge closes that gap:

Public peer plane (controller): new KuraInstance fields meshPublicPeerHost, meshExternalPeers, and meshPublicPeerLoadBalancerAnnotations. reconcileAccountPublicPeerService provisions an account-level L4 LoadBalancer on the peer port (TLS-passthrough, externalTrafficPolicy: Local, hcloud location/node-selector plus external-dns annotations); the peer-cert SAN covers the public host; the NetworkPolicy admits the peer port from 0.0.0.0/0 when public peering is on (the mutual-TLS client cert is the auth boundary).
Gateway advertisement (Kura): managed pods set KURA_PEER_GATEWAY_URL, and internal_status returns the gateway URL to any request that arrived via the gateway host, read from the HTTP/2 :authority and falling back to the h1 Host header. A node skips a discovered peer whose node_url is its own gateway, so same-region managed pods do not hairpin through the public LoadBalancer.
Server: the provisioner sets these fields per mesh region, and Mesh.mesh_peers seeds the gateway URL into the enrollment peer list (Regions.peer_public_url).

Net: a self-hosted node enrolls, receives the gateway as its peer, pushes its writes through it into the managed mesh (which re-replicates internally), and warms from that mesh at join, all without ever accepting an inbound connection.

Replication model and scope

Replication is point-to-point push of each node’s own client writes: on a write, a node enqueues and pushes the artifact to its configured peers, while artifacts it receives are applied locally and not forwarded (there is no gossip). Self-hosted nodes typically sit next to a customer’s runners, which do most of the writing, so the directions that matter are covered:

Self-hosted to self-hosted (continuous). Each node pushes its writes to its sibling peers, so a customer’s own mesh stays consistent. The nodes are configured to peer with each other.
Self-hosted to Tuist-managed (continuous). Each enrolled node pushes its writes through the gateway into the managed mesh, which re-replicates internally. Because the nodes are outbound-only, managed cannot pull from them; the data flows by the nodes pushing. Each write-ingesting node peers with the gateway (with no gossip, a write on a non-bridged node does not reach managed transitively).
Tuist-managed to self-hosted (join-time snapshot). A node warms from the managed mesh when it joins. Continuously propagating new managed-side writes to an already-joined node is a deferred follow-up; it is not required for the runner-adjacent use case where the self-hosted nodes are the write origin.

Security model

Isolation is the per-account CA described above: a node’s leaf chains only to its account’s CA, so it can join only that account’s mesh and cannot present a cert any other account’s nodes would accept. Properties that hold in both topologies:

The node’s private key is generated on the node and never leaves it (enrollment sends only a CSR).
Self-hosted nodes never receive the symmetric Guardian verifier secret (which could mint tokens for any tenant); cache-API auth is tenant-scoped introspection, fail-closed.
Nodes are outbound-only and accept no inbound connections.

What crosses into Tuist depends on the topology, and customers should pick deliberately:

Standalone (own CA, own mesh, no managed region): artifact bytes never leave the customer’s infrastructure. Tuist receives only the control-plane metadata it needs to function (registration heartbeats, usage events, token introspection). This is the zero-trust path.
Bridged (Tuist-managed CA + node-to-managed push): by design, artifacts a node ingests are pushed into the managed mesh, so their bytes and manifests (keys, namespaces, sizes) reside in Tuist’s infrastructure. The per-account CA is also Tuist-managed (the server reads it to sign CSRs). Because nodes are outbound-only, a compromise of that CA key cannot pull a node’s local cache; its worst case is integrity (a malicious gateway serving poisoned artifacts to a node that pulls from it), not exfiltration. A customer who requires that the vendor be cryptographically unable to touch their cache should run standalone, or a future customer-held-CA option for bridged.

End-to-end validation (staging)

A locally-run node was deployed against the real staging managed mesh. It enrolled, got an account-CA-signed leaf (issuer is the account’s peer CA), joined, and reached state: serving.

Warm-up pull (managed to self-hosted, at join): the node pulled the account’s full mesh as its bootstrap snapshot; its data directory grew to ~6.8 GB.
Push (self-hosted to managed, continuous): artifacts written through the node were pushed to the gateway and accepted by the managed mesh (upsert_artifact result=ok, with the replication counter incrementing across writes).

Caveats and follow-ups

Continuous managed-to-node propagation is a deferred follow-up. As described under Replication model and scope, a self-hosted node warms from the managed mesh at join (validated above) but does not yet continuously pull new managed-side writes. The node and node-to-managed directions, which carry the runner-adjacent use case, work via push and are the supported scope here.

Docs

server/priv/docs/en/guides/cache/self-host.md covers Kura mesh self-hosting (superseding the Elixir cache instructions) for both topologies: bridged (managed mesh plus self-hosted nodes) and standalone (self-hosted only), including what enrollment auto-provisions (the peer TLS) versus what the customer provides.

Validation

Server: mix test test/tuist/kura/ test/tuist/oauth/introspection_test.exs test/tuist_web/controllers/internal/kura_mesh_controller_test.exs test/tuist_web/controllers/internal/kura_usage_controller_test.exs test/tuist_web/live/cache_live_test.exs.
End-to-end on staging as described above (enroll to serving, two-way replication).

Comments

No GitHub comments yet.