This change is now available in xcresult-processor-image@0.19.0. Update to this version to use it.
Hive
feat(kura): co-locate runner-cache Kura nodes with the runner fleet
GitHub issue · Closed
Summary
Adds the private-region concept on top of the umbrella-cluster topology (now on main) so a single-replica Kura node can be co-located with a customer’s runners-as-a-service pool inside the cluster. The runner dispatch chain hands the runner a cache_endpoint_url it exports as TUIST_CACHE_ENDPOINT; the Gradle plugin / CLI honor that as a cache-endpoint override, so build cache traffic stays on the cluster’s internal Service DNS instead of going through the public Kura ingress.
Validated end-to-end in staging with real Gradle build-cache traffic on a kata runner (see “Staging E2E results” below).
What’s in this PR
Regions (Tuist.Kura.Regions)
Regions.private?/1+Regions.selectable/0. Private regions stay inavailable/0(socreate_server/1accepts them) but are hidden from the customer picker — they are control-plane-provisioned, not customer-picked.- Two private runner-cache regions in a
@private_region_specscatalog (mirroring main’s spec-driven@managed_region_specs):scw-fr-par-runners— production target, pinned to thekura-scw-fr-parnode pool,scw-bssd.hetzner-staging-runners— staging-only, pinned to the umbrella cluster’skuranode pool. Enabled invalues-managed-staging.yaml.
KuraInstance CRD
- New
spec.privateboolean — the explicit marker for a no-public-endpoint instance. The controller’s topology already does the right thing when public hosts are absent (ClusterIPService, no Ingress, no Certificate). The hand-maintained CRD yaml ininfra/helm/tuist/crds/now carries the field; Helm never upgradescrds/, so each environment needs a one-timekubectl applyof the CRD (done on staging).
Activation (Tuist.Kura.activate_server/2)
- Private regions skip the public DNS + HTTPS
/upprobe and theaccount_cache_endpointsmirror (CLI-facing; developer machines can’t reach the in-cluster endpoint). Provisioner.public_url/3interpolatesprivate_url_template— e.g.http://kura-tuist-staging.kura.svc.cluster.local:4000.
Lifecycle (Tuist.Kura.RunnerCache, runs in the reconciler tick)
- Identity rule: an account with ≥1 Runner Profile and the
:runnersfeature flag enabled has exactly one non-destroyed node in the active private region; accounts without profiles (or flag off) have none. (Rebased onto main’s profile+flag model afteraccounts.runner_max_concurrentwas dropped.) - Self-heals: failed nodes that never had a successful deployment are retried each tick — a transient apiserver error or missing CRD field can’t strand an account.
Runner routing
Tuist.Kura.runner_cache_endpoint_url/1: in-cluster URL for the account’s active private node, ornil.RunnersController.dispatchreturnscache_endpoint_urlin the claim response.- Linux runner image: split-container shape stages the URL at
<JIT_PATH>.cache-endpoint;run-job.shexportsTUIST_CACHE_ENDPOINTbeforeexec. Rollout-bridge + macOS images export in place. - Cache routing is a soft optimization — a staging failure logs and falls back to the CLI’s default resolution.
NetworkPolicy
- Runner-namespace egress carve-out to the
kuranamespace on 4000/50051. - Fixes a dispatch-egress livelock: the policy’s idle-only scoping (
runner-pool-owner DoesNotExist) revoked dispatch egress mid-claim — Cilium recomputes the identity when the server stamps the owner label, and when revocation won the race the claim response was lost, leaving the poller stuck retrying againstPOLICY_DENIEDforever (observed via Hubble; both staging warm pollers were zombies, so every claim fell through to other capacity). Dispatch egress now selects all runner Pods; the audience-scoped SA token (poller-only) is the access control.
Staging E2E results
name: Kura Runner Cache Routing Smoke
# A/B measurement of the runner-local Kura build cache using REAL
# Gradle build-cache traffic, exercising this PR's product flow
# end-to-end (fire: runner image pinned, node active):
#
# 1. The RunnerCache reconciler provisions a private (in-cluster
# only) Kura node for every account with Runner Profiles.
# 2. Runner dispatch includes `cache_endpoint_url` in the claim
# response once that node is active.
# 3. The runner image (`run-job.sh`) exports it as
# `TUIST_CACHE_ENDPOINT`, which the Tuist Gradle plugin honours
# (TuistConfigReader.kt) for every remote build-cache request.
#
# Phases:
# * routed — the build uses the dispatch-provided endpoint as-is
# (the PR's actual routing; in-cluster Service DNS).
# * baseline — `TUIST_CACHE_ENDPOINT` is unset, so the plugin falls
# back to its default cache resolution through
# staging.tuist.dev (the status quo this PR improves).
#
# Each phase runs `ITERATIONS` cold (cache-miss: all local caches
# purged, artifacts pulled from the remote) and warm (cache-hit)
# builds of the Tuist Gradle plugin and reports wall-clock stats.
#
# Auth: `tuist auth login` in CI uses GitHub OIDC; staging maps the
# repo claim (tuist/tuist) to the staging `tuist/tuist` project via
# its `vcs_connections` row. Requires `id-token: write`.
on:
workflow_dispatch:
inputs:
project_handle:
description: Staging project handle the build is attributed to
required: false
default: tuist/tuist
type: string
cache_endpoint_override:
description: Optional explicit cache endpoint for the routed phase (defaults to the dispatch-provided TUIST_CACHE_ENDPOINT)
required: false
type: string
iterations:
description: Repetitions per phase (each is a clean build that round-trips Gradle's cache)
required: false
default: '3'
type: string
push:
branches:
- feat/kura-scaleway-runner-cache
paths:
- .github/workflows/kura-runner-cache-routing-smoke.yml
permissions:
contents: read
# `tuist auth login` detects CI and mints a GitHub OIDC id-token to
# exchange with staging.tuist.dev for an access token.
id-token: write
env:
MISE_VERSION: "2026.4.18"
MISE_GITHUB_TOKEN: ${{ github.token }}
MISE_TASK_RUN_AUTO_INSTALL: 0
MISE_GITHUB_ATTESTATIONS: 0
# Force the CLI + Gradle plugin to talk to staging, not the local
# acceptance-test server some runner shapes bake in via
# TUIST_SERVER_URL.
TUIST_URL: https://staging.tuist.dev
PROJECT_HANDLE: ${{ inputs.project_handle || 'tuist/tuist' }}
CACHE_ENDPOINT_OVERRIDE: ${{ inputs.cache_endpoint_override || '' }}
ITERATIONS: ${{ inputs.iterations || '3' }}
jobs:
routing-smoke:
# MUST be the staging-scoped profile label, not a `shape-*` label.
# Both production and staging receive the tuist/tuist webhooks and
# advertise identical `shape-linux-*` dispatch labels (same chart),
# so a shape-labeled job races both environments' pollers and prod's
# warm fleet wins — the job then runs in the production cluster
# where staging's in-cluster Kura Service doesn't exist (NXDOMAIN /
# dead ClusterIPs). `tuist-staging-<profile>` labels resolve only on
# staging, pinning the job to staging kata runners.
runs-on: tuist-staging-linux
timeout-minutes: 30
steps:
- name: Resolve cache endpoint
shell: bash {0}
run: |
# `TUIST_CACHE_ENDPOINT` is exported by the runner image's
# run-job.sh when the dispatch response carried
# `cache_endpoint_url` — i.e. when this account's private
# runner-cache Kura node is active. That inherited value IS
# the system under test.
echo "pod(hostname)=$(hostname) at $(date -u +%FT%TZ)"
endpoint="${CACHE_ENDPOINT_OVERRIDE:-${TUIST_CACHE_ENDPOINT:-}}"
if [ -z "$endpoint" ]; then
echo "::error::no cache endpoint: dispatch did not include cache_endpoint_url (private Kura node not active yet?) and no override input was provided"
exit 1
fi
echo "routed endpoint: $endpoint"
echo "ROUTED_ENDPOINT=$endpoint" >> "$GITHUB_ENV"
code=$(curl -sS -o /dev/null -w '%{http_code}' --max-time 8 "$endpoint/up" 2>/dev/null) || code=000
if [ "$code" = "000" ]; then
echo "::error::routed endpoint unreachable: $endpoint/up"
exit 1
fi
echo "reachability: $endpoint/up -> HTTP $code"
- uses: actions/checkout@v4
- uses: jdx/mise-action@v4.0.1
with:
version: ${{ env.MISE_VERSION }}
install_args: "java gradle tuist"
cache: "true"
- uses: actions/setup-java@v4
with:
distribution: temurin
java-version: '21'
- name: Show environment
run: |
set -euo pipefail
echo "Hostname: $(hostname)"
echo "Kernel: $(uname -a)"
echo "Project: $PROJECT_HANDLE"
echo "Server: $TUIST_URL"
echo "Routed URL: $ROUTED_ENDPOINT"
echo "Iterations: $ITERATIONS"
- name: Authenticate with staging Tuist
run: |
# CI mode → GitHub OIDC: the CLI requests an id-token from
# the workflow and exchanges it with staging.tuist.dev for an
# access token scoped to the projects connected to this repo
# (staging project tuist/tuist). The token lands in the local
# credential store keyed by TUIST_URL, where the Gradle
# plugin's TokenProvider picks it up. Clear the acceptance
# shape's localhost server env first so the token is keyed
# against staging.
unset TUIST_SERVER_URL TUIST_HOSTED
if ! tuist auth login; then
echo "::group::tuist session logs"
cat "$HOME"/.local/state/tuist/sessions/*/logs.txt 2>/dev/null | tail -80
echo "::endgroup::"
exit 1
fi
- name: Configure Gradle project handle
run: |
set -euo pipefail
# Attribute the build to the staging project the OIDC token
# covers.
cat > gradle/tuist.toml <<TOML
project = "${PROJECT_HANDLE}"
TOML
- name: Run A/B benchmark
shell: bash {0}
run: |
IT="$ITERATIONS"
# Clears every layer Gradle would hit before talking to the
# remote, so each "miss" build round-trips both the local
# build-cache and the remote Tuist build-cache.
purge_gradle_caches() {
rm -rf gradle/build gradle/.gradle ~/.gradle/caches/build-cache-1 || true
}
# Runs `./gradlew assemble --build-cache` and prints total
# wall-clock seconds. `assemble` not `build`: keeps the run
# cache-bound rather than test-bound. For the routed phase
# the dispatch-provided endpoint is exported; for baseline
# the variable is removed so the plugin uses its default
# cache resolution.
# Build failures must fail the smoke run — a timing for a
# build that errored (or never reached the cache) is worse
# than no timing, because the job would go green while the
# routed path is broken. measure_one runs in a command
# substitution (subshell), so failures are recorded in a
# sentinel file rather than a shell variable, and the step
# exits non-zero after the timing table is printed.
rm -f /tmp/gradle-build-failures
measure_one() {
local path=$1
local kind=$2 # "miss" (purge first) or "hit" (warm cache)
if [ "$kind" = "miss" ]; then purge_gradle_caches; fi
local started ended status=0
started=$(date +%s.%N)
if [ "$path" = routed ]; then
( cd gradle && TUIST_CACHE_ENDPOINT="$ROUTED_ENDPOINT" \
./gradlew --no-daemon -Dorg.gradle.caching=true \
--build-cache assemble \
>>/tmp/gradle-${path}-${kind}.log 2>&1 ) || status=$?
else
( cd gradle && env -u TUIST_CACHE_ENDPOINT \
./gradlew --no-daemon -Dorg.gradle.caching=true \
--build-cache assemble \
>>/tmp/gradle-${path}-${kind}.log 2>&1 ) || status=$?
fi
ended=$(date +%s.%N)
if [ "$status" -ne 0 ]; then
echo "${path}/${kind} exit=${status}" >> /tmp/gradle-build-failures
echo "BUILD FAILED (${path}/${kind}, exit ${status}) — tail of log:" >&2
tail -20 "/tmp/gradle-${path}-${kind}.log" >&2
fi
python3 -c "import sys; print(f'{float(sys.argv[1]) - float(sys.argv[2]):.3f}')" "$ended" "$started"
}
declare -A results
for kind in miss hit; do
for path in routed baseline; do
for i in $(seq 1 "$IT"); do
echo "::group::$path / $kind / iteration $i"
t=$(measure_one "$path" "$kind")
results["$path|$kind|$i"]=$t
echo "wall-clock: ${t}s"
tail -30 "/tmp/gradle-${path}-${kind}.log"
echo "::endgroup::"
done
done
done
{
echo "path,kind,iter,seconds"
for key in "${!results[@]}"; do
IFS='|' read -r path kind iter <<<"$key"
echo "$path,$kind,$iter,${results[$key]}"
done
} | tee /tmp/results.csv
if [ -s /tmp/gradle-build-failures ]; then
echo "::error::Gradle build(s) failed — timings above are not trustworthy"
cat /tmp/gradle-build-failures
exit 1
fi
- name: Summarize
run: |
set -euo pipefail
python3 - <<'PY'
import csv, statistics
rows = []
with open('/tmp/results.csv') as f:
for r in csv.DictReader(f):
rows.append({"path": r["path"], "kind": r["kind"], "iter": int(r["iter"]), "seconds": float(r["seconds"])})
def col(kind, path):
return [r["seconds"] for r in rows if r["kind"] == kind and r["path"] == path]
def stats(vs):
if not vs:
return None
sv = sorted(vs)
return {
"n": len(sv),
"min": sv[0],
"max": sv[-1],
"mean": statistics.mean(sv),
"median": statistics.median(sv),
}
print()
print(f"{'phase':<10} {'path':<10} {'n':>3} {'min':>8} {'median':>8} {'mean':>8} {'max':>8} (seconds wall-clock)")
print("-" * 66)
for kind in ["miss", "hit"]:
for path in ["baseline", "routed"]:
s = stats(col(kind, path))
if s is None:
print(f"{kind:<10} {path:<10} {'-':>3}")
continue
print(f"{kind:<10} {path:<10} {s['n']:>3} {s['min']:>8.2f} {s['median']:>8.2f} {s['mean']:>8.2f} {s['max']:>8.2f}")
print()
for kind in ["miss", "hit"]:
base = stats(col(kind, "baseline"))
routed = stats(col(kind, "routed"))
if base and routed and routed["median"] > 0:
saved = base["median"] - routed["median"]
ratio = base["median"] / routed["median"]
print(f"{kind}: routed median is {ratio:.2f}x the speed of baseline ({saved:+.2f}s saved per build)")
PY
- name: Upload Gradle logs
if: always()
uses: actions/upload-artifact@v4
with:
name: gradle-logs
path: /tmp/gradle-*.log
if-no-files-found: ignore
retention-days: 7
Real ./gradlew assemble --build-cache runs of the Gradle plugin on a staging kata runner (3 iterations per cell), claim → JIT → TUIST_CACHE_ENDPOINT exported by the runner image from the dispatch response:
| phase | path | median wall-clock |
|---|---|---|
| cache-miss (cold) | no remote cache | 31.7 s |
| cache-miss (cold) | routed (in-cluster Kura) | 7.45 s |
| cache-hit (warm) | no remote cache | 11.5 s |
| cache-hit (warm) | routed | 6.0 s |
Cold builds restore from the runner-local node 4.3× faster than recompiling (24s saved per build). Caveat: the “baseline” degraded to no-remote-cache because staging’s legacy public cache endpoints (cache-*-staging.tuist.dev) are down (separate issue — kamal app hung behind a healthy proxy/TLS). The apples-to-apples same-node comparison was completed afterwards — see the next section.
Same-node path A/B: in-cluster vs public ingress (completed)
Measured from a pod on a staging bare-metal runner node under the runner fleet’s exact NetworkPolicies, against the same Kura node over both paths (public leg via a temporary ingress, removed after the run).
Request latency (/up, 40 requests per cell):
| public (LB + TLS + nginx) | private (in-cluster svc) | delta | |
|---|---|---|---|
| cold connection p50 | 21.0 ms | 5.0 ms | ~4× |
| cold connection p90 | 24.6 ms | 6.0 ms | |
| warm (reused) | 1.8 ms | 0.5 ms | ~3.5× |
~13.7 ms of the public cold path is TLS handshake alone. Cache clients open fresh connections per job step, so the cold row is representative.
Bulk throughput (200 MB artifact, module multipart flow, 3 iterations per cell):
| public | private | |
|---|---|---|
| upload (20 × 10 MB parts) | 40–42 MB/s | 88–99 MB/s |
| download (single GET) | 107 MB/s | 107–108 MB/s |
Interpretation:
- Uploads pay ~2.3× through the gateway — with request buffering off, every byte streams synchronously through the nginx workers (TLS decrypt + proxy), which are CPU-bound well before the link is.
- Downloads are path-identical same-DC (~1 Gbit ceiling on both), so the gateway is not a download bottleneck when there is no WAN in the path.
- The WAN is the real divider: this A/B has no WAN leg (runner and LB share a DC). External runners add ~20 ms RTT, where single-stream TCP becomes window-limited (roughly 12–50 MB/s at common window sizes) plus slow start on every fresh connection — that penalty applies to downloads too. Co-located runners keep the full ~1 Gbit/s with sub-percent variance.
Benchmark tooling for future runs (e.g. a true WAN measurement from a GitHub-hosted runner): tuist/cache-benchmark#2.
Findings & follow-ups surfaced by the E2E (separate issues to file)
- Cross-env claim leak: production and staging both receive
tuist/tuistwebhooks and advertise identicalshape-linux-*dispatch labels, so a shape-labeled job races both environments — all early smoke runs silently executed on production runners. Staging-targeting jobs must usetuist-staging-*profile labels. - Legacy staging cache down:
cache-eu-central-staging.tuist.dev/cache-us-east-staging.tuist.devaccept TCP/TLS but the app behind kamal-proxy hangs; staging users currently have no default remote cache. - Staging ClickHouse schema drift:
runner_jobs.log_archived_atwas missing despiteschema_migrationsclaiming the migration ran (applied manually). - Cilium
enable-endpoint-routes=trueset on staging (documented kata-compat setting; to be persisted in the cilium chart values).
Phase 3 (follow-up PR)
Retry backoff for self-healed nodes, cluster networking for macOS Tart VMs so the Scaleway Apple-Silicon fleet can consume its local nodes (dispatch already gates on Catalog.fleet_on_cluster_network?/1), cluster-locality check in dispatch — cache_endpoint_url must only be handed to runners in the same cluster as the node (svc DNS doesn’t cross clusters; staging hides this because runners and Kura share one cluster, but prod Linux runners live in Hetzner while scw-fr-par-runners nodes would live in Scaleway — this is a hard precondition for activating the Scaleway region), RunnerCache test coverage, residency gating, observability.