Hive
feat(server): self-hosted Kura nodes with per-account mesh CA enrollment
GitHub issue · Closed
Lets a customer run their own Kura cache nodes, authorized by Tuist’s control plane, and bridge them into the same mutually-authenticated mesh as their Tuist-managed nodes. Typically the nodes sit next to a customer’s runners, so writes against the self-hosted nodes replicate into the shared managed cache, and a node warms from that cache when it joins. A customer never has to choose between owning the whole mesh and running a single node; spinning one up is a credential and a URL.
Why
Customers want Kura cache nodes in their own infrastructure (locality, data residency). The control plane should own authentication, certificate issuance, and discovery so the customer supplies only a credential and a URL.
Two load-bearing constraints shaped the design:
- Kura’s hot-path auth first tries to verify a Tuist Guardian JWT locally using a symmetric HS512 secret. That secret can mint tokens for any tenant, so it cannot be handed to a customer. Self-hosted nodes run without the verifier and authorize uncached requests via the control plane’s introspection endpoint; making that introspection tenant-scoped is the security boundary.
- Kura’s peer mTLS verifier trusts any client cert that chains to the configured CA, with no identity check. The CA is the trust boundary, so isolation comes from a per-account CA.
What changed
Control-plane authorization
- Tenant-scoped credential: per-account
kura_self_hosted_clients(Bcrypt at rest with the standard pepper, plaintext shown once, a suffix-masked hint thereafter), issued and revoked from the account’s Cache page. - Per-account introspection and usage constraint: both endpoints branch the unconstrained Tuist control-plane client from a tenant-scoped client, so a self-hosted node can only introspect its own tenant’s tokens and report its own usage.
- Lease-based registration: self-hosted nodes self-register their client-facing URL via heartbeats, and that URL is the endpoint the CLI routes to (resolved under the
:kuratechnology). A node ages out of rotation when heartbeats stop. There is no manually-managed endpoint list; the registered node is the endpoint. - Cache-API authentication: the Tuist auth hook ships in the Kura image, so a self-hosted node enables it with
KURA_EXTENSION_*to require a valid Tuist token on every cache request, authorized via tenant-scoped introspection (no symmetric verifier secret, fail-closed). Covered by introspection-only tests against the real hook; managed pods keep their ConfigMap-mounted hook, which shadows the bundled file.
Enrollment (controller-managed per-account CA)
- The Kura controller owns the per-account peer CA (the
kura-<handle>-peer-casecret).Mesh.enroll_nodeandsign_node_certificateread that secret and sign the node’s CSR from it with an issuer-controlled SAN, so the node’s leaf is trusted by the managed pods (same CA) and vice versa. The enroll endpoint returns503 ca_unavailablewhen the account has no managed mesh yet. - Kura enroll-on-boot generates the keypair locally (the private key never leaves the node), enrolls, writes the cert material to the
KURA_INTERNAL_TLS_*paths, and adopts the returned tenant and peers.
Two-way bridge
Managed-mesh discovery advertises each pod’s in-cluster *.svc.cluster.local peer URL, which an off-cluster node cannot reach. The bridge closes that gap:
- Public peer plane (controller): new
KuraInstancefieldsmeshPublicPeerHost,meshExternalPeers, andmeshPublicPeerLoadBalancerAnnotations.reconcileAccountPublicPeerServiceprovisions an account-level L4 LoadBalancer on the peer port (TLS-passthrough,externalTrafficPolicy: Local, hcloudlocation/node-selectorplus external-dns annotations); the peer-cert SAN covers the public host; the NetworkPolicy admits the peer port from0.0.0.0/0when public peering is on (the mutual-TLS client cert is the auth boundary). - Gateway advertisement (Kura): managed pods set
KURA_PEER_GATEWAY_URL, andinternal_statusreturns the gateway URL to any request that arrived via the gateway host, read from the HTTP/2:authorityand falling back to the h1Hostheader. A node skips a discovered peer whosenode_urlis its own gateway, so same-region managed pods do not hairpin through the public LoadBalancer. - Server: the provisioner sets these fields per mesh region, and
Mesh.mesh_peersseeds the gateway URL into the enrollment peer list (Regions.peer_public_url).
Net: a self-hosted node enrolls, receives the gateway as its peer, pushes its writes through it into the managed mesh (which re-replicates internally), and warms from that mesh at join, all without ever accepting an inbound connection.
Replication model and scope
Replication is point-to-point push of each node’s own client writes: on a write, a node enqueues and pushes the artifact to its configured peers, while artifacts it receives are applied locally and not forwarded (there is no gossip). Self-hosted nodes typically sit next to a customer’s runners, which do most of the writing, so the directions that matter are covered:
- Self-hosted to self-hosted (continuous). Each node pushes its writes to its sibling peers, so a customer’s own mesh stays consistent. The nodes are configured to peer with each other.
- Self-hosted to Tuist-managed (continuous). Each enrolled node pushes its writes through the gateway into the managed mesh, which re-replicates internally. Because the nodes are outbound-only, managed cannot pull from them; the data flows by the nodes pushing. Each write-ingesting node peers with the gateway (with no gossip, a write on a non-bridged node does not reach managed transitively).
- Tuist-managed to self-hosted (join-time snapshot). A node warms from the managed mesh when it joins. Continuously propagating new managed-side writes to an already-joined node is a deferred follow-up; it is not required for the runner-adjacent use case where the self-hosted nodes are the write origin.
Security model
Isolation is the per-account CA described above: a node’s leaf chains only to its account’s CA, so it can join only that account’s mesh and cannot present a cert any other account’s nodes would accept. Properties that hold in both topologies:
- The node’s private key is generated on the node and never leaves it (enrollment sends only a CSR).
- Self-hosted nodes never receive the symmetric Guardian verifier secret (which could mint tokens for any tenant); cache-API auth is tenant-scoped introspection, fail-closed.
- Nodes are outbound-only and accept no inbound connections.
What crosses into Tuist depends on the topology, and customers should pick deliberately:
- Standalone (own CA, own mesh, no managed region): artifact bytes never leave the customer’s infrastructure. Tuist receives only the control-plane metadata it needs to function (registration heartbeats, usage events, token introspection). This is the zero-trust path.
- Bridged (Tuist-managed CA + node-to-managed push): by design, artifacts a node ingests are pushed into the managed mesh, so their bytes and manifests (keys, namespaces, sizes) reside in Tuist’s infrastructure. The per-account CA is also Tuist-managed (the server reads it to sign CSRs). Because nodes are outbound-only, a compromise of that CA key cannot pull a node’s local cache; its worst case is integrity (a malicious gateway serving poisoned artifacts to a node that pulls from it), not exfiltration. A customer who requires that the vendor be cryptographically unable to touch their cache should run standalone, or a future customer-held-CA option for bridged.
End-to-end validation (staging)
A locally-run node was deployed against the real staging managed mesh. It enrolled, got an account-CA-signed leaf (issuer is the account’s peer CA), joined, and reached state: serving.
- Warm-up pull (managed to self-hosted, at join): the node pulled the account’s full mesh as its bootstrap snapshot; its data directory grew to ~6.8 GB.
- Push (self-hosted to managed, continuous): artifacts written through the node were pushed to the gateway and accepted by the managed mesh (
upsert_artifact result=ok, with the replication counter incrementing across writes).
Caveats and follow-ups
- Continuous managed-to-node propagation is a deferred follow-up. As described under Replication model and scope, a self-hosted node warms from the managed mesh at join (validated above) but does not yet continuously pull new managed-side writes. The node and node-to-managed directions, which carry the runner-adjacent use case, work via push and are the supported scope here.
Docs
server/priv/docs/en/guides/cache/self-host.md covers Kura mesh self-hosting (superseding the Elixir cache instructions) for both topologies: bridged (managed mesh plus self-hosted nodes) and standalone (self-hosted only), including what enrollment auto-provisions (the peer TLS) versus what the customer provides.
Validation
- Server:
mix test test/tuist/kura/ test/tuist/oauth/introspection_test.exs test/tuist_web/controllers/internal/kura_mesh_controller_test.exs test/tuist_web/controllers/internal/kura_usage_controller_test.exs test/tuist_web/live/cache_live_test.exs. - End-to-end on staging as described above (enroll to serving, two-way replication).
No GitHub comments yet.