Hive Hive
Sign in

feat(kura): co-locate runner-cache Kura nodes with the runner fleet

GitHub issue · Closed

Metadata
Source
tuist/tuist #10982
Updated
Jun 24, 2026
Domains
Kura
Details

Summary

Adds the private-region concept on top of the umbrella-cluster topology (now on main) so a single-replica Kura node can be co-located with a customer’s runners-as-a-service pool inside the cluster. The runner dispatch chain hands the runner a cache_endpoint_url it exports as TUIST_CACHE_ENDPOINT; the Gradle plugin / CLI honor that as a cache-endpoint override, so build cache traffic stays on the cluster’s internal Service DNS instead of going through the public Kura ingress.

Validated end-to-end in staging with real Gradle build-cache traffic on a kata runner (see “Staging E2E results” below).

What’s in this PR

Regions (Tuist.Kura.Regions)

  • Regions.private?/1 + Regions.selectable/0. Private regions stay in available/0 (so create_server/1 accepts them) but are hidden from the customer picker — they are control-plane-provisioned, not customer-picked.
  • Two private runner-cache regions in a @private_region_specs catalog (mirroring main’s spec-driven @managed_region_specs):
    • scw-fr-par-runners — production target, pinned to the kura-scw-fr-par node pool, scw-bssd.
    • hetzner-staging-runners — staging-only, pinned to the umbrella cluster’s kura node pool. Enabled in values-managed-staging.yaml.

KuraInstance CRD

  • New spec.private boolean — the explicit marker for a no-public-endpoint instance. The controller’s topology already does the right thing when public hosts are absent (ClusterIP Service, no Ingress, no Certificate). The hand-maintained CRD yaml in infra/helm/tuist/crds/ now carries the field; Helm never upgrades crds/, so each environment needs a one-time kubectl apply of the CRD (done on staging).

Activation (Tuist.Kura.activate_server/2)

  • Private regions skip the public DNS + HTTPS /up probe and the account_cache_endpoints mirror (CLI-facing; developer machines can’t reach the in-cluster endpoint).
  • Provisioner.public_url/3 interpolates private_url_template — e.g. http://kura-tuist-staging.kura.svc.cluster.local:4000.

Lifecycle (Tuist.Kura.RunnerCache, runs in the reconciler tick)

  • Identity rule: an account with ≥1 Runner Profile and the :runners feature flag enabled has exactly one non-destroyed node in the active private region; accounts without profiles (or flag off) have none. (Rebased onto main’s profile+flag model after accounts.runner_max_concurrent was dropped.)
  • Self-heals: failed nodes that never had a successful deployment are retried each tick — a transient apiserver error or missing CRD field can’t strand an account.

Runner routing

  • Tuist.Kura.runner_cache_endpoint_url/1: in-cluster URL for the account’s active private node, or nil.
  • RunnersController.dispatch returns cache_endpoint_url in the claim response.
  • Linux runner image: split-container shape stages the URL at <JIT_PATH>.cache-endpoint; run-job.sh exports TUIST_CACHE_ENDPOINT before exec. Rollout-bridge + macOS images export in place.
  • Cache routing is a soft optimization — a staging failure logs and falls back to the CLI’s default resolution.

NetworkPolicy

  • Runner-namespace egress carve-out to the kura namespace on 4000/50051.
  • Fixes a dispatch-egress livelock: the policy’s idle-only scoping (runner-pool-owner DoesNotExist) revoked dispatch egress mid-claim — Cilium recomputes the identity when the server stamps the owner label, and when revocation won the race the claim response was lost, leaving the poller stuck retrying against POLICY_DENIED forever (observed via Hubble; both staging warm pollers were zombies, so every claim fell through to other capacity). Dispatch egress now selects all runner Pods; the audience-scoped SA token (poller-only) is the access control.

Staging E2E results

name: Kura Runner Cache Routing Smoke
# A/B measurement of the runner-local Kura build cache using REAL
# Gradle build-cache traffic, exercising this PR's product flow
# end-to-end (fire: runner image pinned, node active):
#
# 1. The RunnerCache reconciler provisions a private (in-cluster
# only) Kura node for every account with Runner Profiles.
# 2. Runner dispatch includes `cache_endpoint_url` in the claim
# response once that node is active.
# 3. The runner image (`run-job.sh`) exports it as
# `TUIST_CACHE_ENDPOINT`, which the Tuist Gradle plugin honours
# (TuistConfigReader.kt) for every remote build-cache request.
#
# Phases:
# * routed — the build uses the dispatch-provided endpoint as-is
# (the PR's actual routing; in-cluster Service DNS).
# * baseline — `TUIST_CACHE_ENDPOINT` is unset, so the plugin falls
# back to its default cache resolution through
# staging.tuist.dev (the status quo this PR improves).
#
# Each phase runs `ITERATIONS` cold (cache-miss: all local caches
# purged, artifacts pulled from the remote) and warm (cache-hit)
# builds of the Tuist Gradle plugin and reports wall-clock stats.
#
# Auth: `tuist auth login` in CI uses GitHub OIDC; staging maps the
# repo claim (tuist/tuist) to the staging `tuist/tuist` project via
# its `vcs_connections` row. Requires `id-token: write`.
on:
workflow_dispatch:
inputs:
project_handle:
description: Staging project handle the build is attributed to
required: false
default: tuist/tuist
type: string
cache_endpoint_override:
description: Optional explicit cache endpoint for the routed phase (defaults to the dispatch-provided TUIST_CACHE_ENDPOINT)
required: false
type: string
iterations:
description: Repetitions per phase (each is a clean build that round-trips Gradle's cache)
required: false
default: '3'
type: string
push:
branches:
- feat/kura-scaleway-runner-cache
paths:
- .github/workflows/kura-runner-cache-routing-smoke.yml
permissions:
contents: read
# `tuist auth login` detects CI and mints a GitHub OIDC id-token to
# exchange with staging.tuist.dev for an access token.
id-token: write
env:
MISE_VERSION: "2026.4.18"
MISE_GITHUB_TOKEN: ${{ github.token }}
MISE_TASK_RUN_AUTO_INSTALL: 0
MISE_GITHUB_ATTESTATIONS: 0
# Force the CLI + Gradle plugin to talk to staging, not the local
# acceptance-test server some runner shapes bake in via
# TUIST_SERVER_URL.
TUIST_URL: https://staging.tuist.dev
PROJECT_HANDLE: ${{ inputs.project_handle || 'tuist/tuist' }}
CACHE_ENDPOINT_OVERRIDE: ${{ inputs.cache_endpoint_override || '' }}
ITERATIONS: ${{ inputs.iterations || '3' }}
jobs:
routing-smoke:
# MUST be the staging-scoped profile label, not a `shape-*` label.
# Both production and staging receive the tuist/tuist webhooks and
# advertise identical `shape-linux-*` dispatch labels (same chart),
# so a shape-labeled job races both environments' pollers and prod's
# warm fleet wins — the job then runs in the production cluster
# where staging's in-cluster Kura Service doesn't exist (NXDOMAIN /
# dead ClusterIPs). `tuist-staging-<profile>` labels resolve only on
# staging, pinning the job to staging kata runners.
runs-on: tuist-staging-linux
timeout-minutes: 30
steps:
- name: Resolve cache endpoint
shell: bash {0}
run: |
# `TUIST_CACHE_ENDPOINT` is exported by the runner image's
# run-job.sh when the dispatch response carried
# `cache_endpoint_url` — i.e. when this account's private
# runner-cache Kura node is active. That inherited value IS
# the system under test.
echo "pod(hostname)=$(hostname) at $(date -u +%FT%TZ)"
endpoint="${CACHE_ENDPOINT_OVERRIDE:-${TUIST_CACHE_ENDPOINT:-}}"
if [ -z "$endpoint" ]; then
echo "::error::no cache endpoint: dispatch did not include cache_endpoint_url (private Kura node not active yet?) and no override input was provided"
exit 1
fi
echo "routed endpoint: $endpoint"
echo "ROUTED_ENDPOINT=$endpoint" >> "$GITHUB_ENV"
code=$(curl -sS -o /dev/null -w '%{http_code}' --max-time 8 "$endpoint/up" 2>/dev/null) || code=000
if [ "$code" = "000" ]; then
echo "::error::routed endpoint unreachable: $endpoint/up"
exit 1
fi
echo "reachability: $endpoint/up -> HTTP $code"
- uses: actions/checkout@v4
- uses: jdx/mise-action@v4.0.1
with:
version: ${{ env.MISE_VERSION }}
install_args: "java gradle tuist"
cache: "true"
- uses: actions/setup-java@v4
with:
distribution: temurin
java-version: '21'
- name: Show environment
run: |
set -euo pipefail
echo "Hostname: $(hostname)"
echo "Kernel: $(uname -a)"
echo "Project: $PROJECT_HANDLE"
echo "Server: $TUIST_URL"
echo "Routed URL: $ROUTED_ENDPOINT"
echo "Iterations: $ITERATIONS"
- name: Authenticate with staging Tuist
run: |
# CI mode → GitHub OIDC: the CLI requests an id-token from
# the workflow and exchanges it with staging.tuist.dev for an
# access token scoped to the projects connected to this repo
# (staging project tuist/tuist). The token lands in the local
# credential store keyed by TUIST_URL, where the Gradle
# plugin's TokenProvider picks it up. Clear the acceptance
# shape's localhost server env first so the token is keyed
# against staging.
unset TUIST_SERVER_URL TUIST_HOSTED
if ! tuist auth login; then
echo "::group::tuist session logs"
cat "$HOME"/.local/state/tuist/sessions/*/logs.txt 2>/dev/null | tail -80
echo "::endgroup::"
exit 1
fi
- name: Configure Gradle project handle
run: |
set -euo pipefail
# Attribute the build to the staging project the OIDC token
# covers.
cat > gradle/tuist.toml <<TOML
project = "${PROJECT_HANDLE}"
TOML
- name: Run A/B benchmark
shell: bash {0}
run: |
IT="$ITERATIONS"
# Clears every layer Gradle would hit before talking to the
# remote, so each "miss" build round-trips both the local
# build-cache and the remote Tuist build-cache.
purge_gradle_caches() {
rm -rf gradle/build gradle/.gradle ~/.gradle/caches/build-cache-1 || true
}
# Runs `./gradlew assemble --build-cache` and prints total
# wall-clock seconds. `assemble` not `build`: keeps the run
# cache-bound rather than test-bound. For the routed phase
# the dispatch-provided endpoint is exported; for baseline
# the variable is removed so the plugin uses its default
# cache resolution.
# Build failures must fail the smoke run — a timing for a
# build that errored (or never reached the cache) is worse
# than no timing, because the job would go green while the
# routed path is broken. measure_one runs in a command
# substitution (subshell), so failures are recorded in a
# sentinel file rather than a shell variable, and the step
# exits non-zero after the timing table is printed.
rm -f /tmp/gradle-build-failures
measure_one() {
local path=$1
local kind=$2 # "miss" (purge first) or "hit" (warm cache)
if [ "$kind" = "miss" ]; then purge_gradle_caches; fi
local started ended status=0
started=$(date +%s.%N)
if [ "$path" = routed ]; then
( cd gradle && TUIST_CACHE_ENDPOINT="$ROUTED_ENDPOINT" \
./gradlew --no-daemon -Dorg.gradle.caching=true \
--build-cache assemble \
>>/tmp/gradle-${path}-${kind}.log 2>&1 ) || status=$?
else
( cd gradle && env -u TUIST_CACHE_ENDPOINT \
./gradlew --no-daemon -Dorg.gradle.caching=true \
--build-cache assemble \
>>/tmp/gradle-${path}-${kind}.log 2>&1 ) || status=$?
fi
ended=$(date +%s.%N)
if [ "$status" -ne 0 ]; then
echo "${path}/${kind} exit=${status}" >> /tmp/gradle-build-failures
echo "BUILD FAILED (${path}/${kind}, exit ${status}) — tail of log:" >&2
tail -20 "/tmp/gradle-${path}-${kind}.log" >&2
fi
python3 -c "import sys; print(f'{float(sys.argv[1]) - float(sys.argv[2]):.3f}')" "$ended" "$started"
}
declare -A results
for kind in miss hit; do
for path in routed baseline; do
for i in $(seq 1 "$IT"); do
echo "::group::$path / $kind / iteration $i"
t=$(measure_one "$path" "$kind")
results["$path|$kind|$i"]=$t
echo "wall-clock: ${t}s"
tail -30 "/tmp/gradle-${path}-${kind}.log"
echo "::endgroup::"
done
done
done
{
echo "path,kind,iter,seconds"
for key in "${!results[@]}"; do
IFS='|' read -r path kind iter <<<"$key"
echo "$path,$kind,$iter,${results[$key]}"
done
} | tee /tmp/results.csv
if [ -s /tmp/gradle-build-failures ]; then
echo "::error::Gradle build(s) failed — timings above are not trustworthy"
cat /tmp/gradle-build-failures
exit 1
fi
- name: Summarize
run: |
set -euo pipefail
python3 - <<'PY'
import csv, statistics
rows = []
with open('/tmp/results.csv') as f:
for r in csv.DictReader(f):
rows.append({"path": r["path"], "kind": r["kind"], "iter": int(r["iter"]), "seconds": float(r["seconds"])})
def col(kind, path):
return [r["seconds"] for r in rows if r["kind"] == kind and r["path"] == path]
def stats(vs):
if not vs:
return None
sv = sorted(vs)
return {
"n": len(sv),
"min": sv[0],
"max": sv[-1],
"mean": statistics.mean(sv),
"median": statistics.median(sv),
}
print()
print(f"{'phase':<10} {'path':<10} {'n':>3} {'min':>8} {'median':>8} {'mean':>8} {'max':>8} (seconds wall-clock)")
print("-" * 66)
for kind in ["miss", "hit"]:
for path in ["baseline", "routed"]:
s = stats(col(kind, path))
if s is None:
print(f"{kind:<10} {path:<10} {'-':>3}")
continue
print(f"{kind:<10} {path:<10} {s['n']:>3} {s['min']:>8.2f} {s['median']:>8.2f} {s['mean']:>8.2f} {s['max']:>8.2f}")
print()
for kind in ["miss", "hit"]:
base = stats(col(kind, "baseline"))
routed = stats(col(kind, "routed"))
if base and routed and routed["median"] > 0:
saved = base["median"] - routed["median"]
ratio = base["median"] / routed["median"]
print(f"{kind}: routed median is {ratio:.2f}x the speed of baseline ({saved:+.2f}s saved per build)")
PY
- name: Upload Gradle logs
if: always()
uses: actions/upload-artifact@v4
with:
name: gradle-logs
path: /tmp/gradle-*.log
if-no-files-found: ignore
retention-days: 7

Real ./gradlew assemble --build-cache runs of the Gradle plugin on a staging kata runner (3 iterations per cell), claim → JIT → TUIST_CACHE_ENDPOINT exported by the runner image from the dispatch response:

phase path median wall-clock
cache-miss (cold) no remote cache 31.7 s
cache-miss (cold) routed (in-cluster Kura) 7.45 s
cache-hit (warm) no remote cache 11.5 s
cache-hit (warm) routed 6.0 s

Cold builds restore from the runner-local node 4.3× faster than recompiling (24s saved per build). Caveat: the “baseline” degraded to no-remote-cache because staging’s legacy public cache endpoints (cache-*-staging.tuist.dev) are down (separate issue — kamal app hung behind a healthy proxy/TLS). The apples-to-apples same-node comparison was completed afterwards — see the next section.

Same-node path A/B: in-cluster vs public ingress (completed)

Measured from a pod on a staging bare-metal runner node under the runner fleet’s exact NetworkPolicies, against the same Kura node over both paths (public leg via a temporary ingress, removed after the run).

Request latency (/up, 40 requests per cell):

public (LB + TLS + nginx) private (in-cluster svc) delta
cold connection p50 21.0 ms 5.0 ms ~4×
cold connection p90 24.6 ms 6.0 ms
warm (reused) 1.8 ms 0.5 ms ~3.5×

~13.7 ms of the public cold path is TLS handshake alone. Cache clients open fresh connections per job step, so the cold row is representative.

Bulk throughput (200 MB artifact, module multipart flow, 3 iterations per cell):

public private
upload (20 × 10 MB parts) 40–42 MB/s 88–99 MB/s
download (single GET) 107 MB/s 107–108 MB/s

Interpretation:

  • Uploads pay ~2.3× through the gateway — with request buffering off, every byte streams synchronously through the nginx workers (TLS decrypt + proxy), which are CPU-bound well before the link is.
  • Downloads are path-identical same-DC (~1 Gbit ceiling on both), so the gateway is not a download bottleneck when there is no WAN in the path.
  • The WAN is the real divider: this A/B has no WAN leg (runner and LB share a DC). External runners add ~20 ms RTT, where single-stream TCP becomes window-limited (roughly 12–50 MB/s at common window sizes) plus slow start on every fresh connection — that penalty applies to downloads too. Co-located runners keep the full ~1 Gbit/s with sub-percent variance.

Benchmark tooling for future runs (e.g. a true WAN measurement from a GitHub-hosted runner): tuist/cache-benchmark#2.

Findings & follow-ups surfaced by the E2E (separate issues to file)

  • Cross-env claim leak: production and staging both receive tuist/tuist webhooks and advertise identical shape-linux-* dispatch labels, so a shape-labeled job races both environments — all early smoke runs silently executed on production runners. Staging-targeting jobs must use tuist-staging-* profile labels.
  • Legacy staging cache down: cache-eu-central-staging.tuist.dev / cache-us-east-staging.tuist.dev accept TCP/TLS but the app behind kamal-proxy hangs; staging users currently have no default remote cache.
  • Staging ClickHouse schema drift: runner_jobs.log_archived_at was missing despite schema_migrations claiming the migration ran (applied manually).
  • Cilium enable-endpoint-routes=true set on staging (documented kata-compat setting; to be persisted in the cilium chart values).

Phase 3 (follow-up PR)

Retry backoff for self-healed nodes, cluster networking for macOS Tart VMs so the Scaleway Apple-Silicon fleet can consume its local nodes (dispatch already gates on Catalog.fleet_on_cluster_network?/1), cluster-locality check in dispatchcache_endpoint_url must only be handed to runners in the same cluster as the node (svc DNS doesn’t cross clusters; staging hides this because runners and Kura share one cluster, but prod Linux runners live in Hetzner while scw-fr-par-runners nodes would live in Scaleway — this is a hard precondition for activating the Scaleway region), RunnerCache test coverage, residency gating, observability.

Comments
TA
tuist-atlas[bot] Jun 13, 2026

The co-located runner-cache Kura nodes feature is now available in runner image runner-image@0.6.0. Update to this version to use the in-cluster cache endpoint routing for your runners.

TA
tuist-atlas[bot] Jun 13, 2026

This change is now available in version linux-runner-image@0.6.0. Update to ghcr.io/tuist/tuist-linux-runner:0.6.0 to use it.