feat(kura): co-locate runner-cache Kura nodes with the runner fleet

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #10982

Updated

Jun 24, 2026

Domains

Kura

Details

Summary

Adds the private-region concept on top of the umbrella-cluster topology (now on main) so a single-replica Kura node can be co-located with a customer’s runners-as-a-service pool inside the cluster. The runner dispatch chain hands the runner a cache_endpoint_url it exports as TUIST_CACHE_ENDPOINT; the Gradle plugin / CLI honor that as a cache-endpoint override, so build cache traffic stays on the cluster’s internal Service DNS instead of going through the public Kura ingress.

Validated end-to-end in staging with real Gradle build-cache traffic on a kata runner (see “Staging E2E results” below).

What’s in this PR

Regions (Tuist.Kura.Regions)

Regions.private?/1 + Regions.selectable/0. Private regions stay in available/0 (so create_server/1 accepts them) but are hidden from the customer picker — they are control-plane-provisioned, not customer-picked.
Two private runner-cache regions in a @private_region_specs catalog (mirroring main’s spec-driven @managed_region_specs):
- scw-fr-par-runners — production target, pinned to the kura-scw-fr-par node pool, scw-bssd.
- hetzner-staging-runners — staging-only, pinned to the umbrella cluster’s kura node pool. Enabled in values-managed-staging.yaml.

KuraInstance CRD

New spec.private boolean — the explicit marker for a no-public-endpoint instance. The controller’s topology already does the right thing when public hosts are absent (ClusterIP Service, no Ingress, no Certificate). The hand-maintained CRD yaml in infra/helm/tuist/crds/ now carries the field; Helm never upgrades crds/, so each environment needs a one-time kubectl apply of the CRD (done on staging).

Activation (Tuist.Kura.activate_server/2)

Private regions skip the public DNS + HTTPS /up probe and the account_cache_endpoints mirror (CLI-facing; developer machines can’t reach the in-cluster endpoint).
Provisioner.public_url/3 interpolates private_url_template — e.g. http://kura-tuist-staging.kura.svc.cluster.local:4000.

Lifecycle (Tuist.Kura.RunnerCache, runs in the reconciler tick)

Identity rule: an account with ≥1 Runner Profile and the :runners feature flag enabled has exactly one non-destroyed node in the active private region; accounts without profiles (or flag off) have none. (Rebased onto main’s profile+flag model after accounts.runner_max_concurrent was dropped.)
Self-heals: failed nodes that never had a successful deployment are retried each tick — a transient apiserver error or missing CRD field can’t strand an account.

Runner routing

Tuist.Kura.runner_cache_endpoint_url/1: in-cluster URL for the account’s active private node, or nil.
RunnersController.dispatch returns cache_endpoint_url in the claim response.
Linux runner image: split-container shape stages the URL at <JIT_PATH>.cache-endpoint; run-job.sh exports TUIST_CACHE_ENDPOINT before exec. Rollout-bridge + macOS images export in place.
Cache routing is a soft optimization — a staging failure logs and falls back to the CLI’s default resolution.

NetworkPolicy

Runner-namespace egress carve-out to the kura namespace on 4000/50051.
Fixes a dispatch-egress livelock: the policy’s idle-only scoping (runner-pool-owner DoesNotExist) revoked dispatch egress mid-claim — Cilium recomputes the identity when the server stamps the owner label, and when revocation won the race the claim response was lost, leaving the poller stuck retrying against POLICY_DENIED forever (observed via Hubble; both staging warm pollers were zombies, so every claim fell through to other capacity). Dispatch egress now selects all runner Pods; the audience-scoped SA token (poller-only) is the access control.

Staging E2E results

name: Kura Runner Cache Routing Smoke

# A/B measurement of the runner-local Kura build cache using REAL
# Gradle build-cache traffic, exercising this PR's product flow
# end-to-end (fire: runner image pinned, node active):
#
#   1. The RunnerCache reconciler provisions a private (in-cluster
#      only) Kura node for every account with Runner Profiles.
#   2. Runner dispatch includes `cache_endpoint_url` in the claim
#      response once that node is active.
#   3. The runner image (`run-job.sh`) exports it as
#      `TUIST_CACHE_ENDPOINT`, which the Tuist Gradle plugin honours
#      (TuistConfigReader.kt) for every remote build-cache request.
#
# Phases:
#   * routed   — the build uses the dispatch-provided endpoint as-is
#                (the PR's actual routing; in-cluster Service DNS).
#   * baseline — `TUIST_CACHE_ENDPOINT` is unset, so the plugin falls
#                back to its default cache resolution through
#                staging.tuist.dev (the status quo this PR improves).
#
# Each phase runs `ITERATIONS` cold (cache-miss: all local caches
# purged, artifacts pulled from the remote) and warm (cache-hit)
# builds of the Tuist Gradle plugin and reports wall-clock stats.
#
# Auth: `tuist auth login` in CI uses GitHub OIDC; staging maps the
# repo claim (tuist/tuist) to the staging `tuist/tuist` project via
# its `vcs_connections` row. Requires `id-token: write`.

on:
  workflow_dispatch:
    inputs:
      project_handle:
        description: Staging project handle the build is attributed to
        required: false
        default: tuist/tuist
        type: string
      cache_endpoint_override:
        description: Optional explicit cache endpoint for the routed phase (defaults to the dispatch-provided TUIST_CACHE_ENDPOINT)
        required: false
        type: string
      iterations:
        description: Repetitions per phase (each is a clean build that round-trips Gradle's cache)
        required: false
        default: '3'
        type: string
  push:
    branches:
      - feat/kura-scaleway-runner-cache
    paths:
      - .github/workflows/kura-runner-cache-routing-smoke.yml

permissions:
  contents: read
  # `tuist auth login` detects CI and mints a GitHub OIDC id-token to
  # exchange with staging.tuist.dev for an access token.
  id-token: write

env:
  MISE_VERSION: "2026.4.18"
  MISE_GITHUB_TOKEN: ${{ github.token }}
  MISE_TASK_RUN_AUTO_INSTALL: 0
  MISE_GITHUB_ATTESTATIONS: 0
  # Force the CLI + Gradle plugin to talk to staging, not the local
  # acceptance-test server some runner shapes bake in via
  # TUIST_SERVER_URL.
  TUIST_URL: https://staging.tuist.dev
  PROJECT_HANDLE: ${{ inputs.project_handle || 'tuist/tuist' }}
  CACHE_ENDPOINT_OVERRIDE: ${{ inputs.cache_endpoint_override || '' }}
  ITERATIONS: ${{ inputs.iterations || '3' }}

jobs:
  routing-smoke:
    # MUST be the staging-scoped profile label, not a `shape-*` label.
    # Both production and staging receive the tuist/tuist webhooks and
    # advertise identical `shape-linux-*` dispatch labels (same chart),
    # so a shape-labeled job races both environments' pollers and prod's
    # warm fleet wins — the job then runs in the production cluster
    # where staging's in-cluster Kura Service doesn't exist (NXDOMAIN /
    # dead ClusterIPs). `tuist-staging-<profile>` labels resolve only on
    # staging, pinning the job to staging kata runners.
    runs-on: tuist-staging-linux
    timeout-minutes: 30
    steps:
      - name: Resolve cache endpoint
        shell: bash {0}
        run: |
          # `TUIST_CACHE_ENDPOINT` is exported by the runner image's
          # run-job.sh when the dispatch response carried
          # `cache_endpoint_url` — i.e. when this account's private
          # runner-cache Kura node is active. That inherited value IS
          # the system under test.
          echo "pod(hostname)=$(hostname) at $(date -u +%FT%TZ)"
          endpoint="${CACHE_ENDPOINT_OVERRIDE:-${TUIST_CACHE_ENDPOINT:-}}"
          if [ -z "$endpoint" ]; then
            echo "::error::no cache endpoint: dispatch did not include cache_endpoint_url (private Kura node not active yet?) and no override input was provided"
            exit 1
          fi
          echo "routed endpoint: $endpoint"
          echo "ROUTED_ENDPOINT=$endpoint" >> "$GITHUB_ENV"
          code=$(curl -sS -o /dev/null -w '%{http_code}' --max-time 8 "$endpoint/up" 2>/dev/null) || code=000
          if [ "$code" = "000" ]; then
            echo "::error::routed endpoint unreachable: $endpoint/up"
            exit 1
          fi
          echo "reachability: $endpoint/up -> HTTP $code"

      - uses: actions/checkout@v4

      - uses: jdx/mise-action@v4.0.1
        with:
          version: ${{ env.MISE_VERSION }}
          install_args: "java gradle tuist"
          cache: "true"

      - uses: actions/setup-java@v4
        with:
          distribution: temurin
          java-version: '21'

      - name: Show environment
        run: |
          set -euo pipefail
          echo "Hostname:   $(hostname)"
          echo "Kernel:     $(uname -a)"
          echo "Project:    $PROJECT_HANDLE"
          echo "Server:     $TUIST_URL"
          echo "Routed URL: $ROUTED_ENDPOINT"
          echo "Iterations: $ITERATIONS"

      - name: Authenticate with staging Tuist
        run: |
          # CI mode → GitHub OIDC: the CLI requests an id-token from
          # the workflow and exchanges it with staging.tuist.dev for an
          # access token scoped to the projects connected to this repo
          # (staging project tuist/tuist). The token lands in the local
          # credential store keyed by TUIST_URL, where the Gradle
          # plugin's TokenProvider picks it up. Clear the acceptance
          # shape's localhost server env first so the token is keyed
          # against staging.
          unset TUIST_SERVER_URL TUIST_HOSTED
          if ! tuist auth login; then
            echo "::group::tuist session logs"
            cat "$HOME"/.local/state/tuist/sessions/*/logs.txt 2>/dev/null | tail -80
            echo "::endgroup::"
            exit 1
          fi

      - name: Configure Gradle project handle
        run: |
          set -euo pipefail
          # Attribute the build to the staging project the OIDC token
          # covers.
          cat > gradle/tuist.toml <<TOML
          project = "${PROJECT_HANDLE}"
          TOML

      - name: Run A/B benchmark
        shell: bash {0}
        run: |
          IT="$ITERATIONS"

          # Clears every layer Gradle would hit before talking to the
          # remote, so each "miss" build round-trips both the local
          # build-cache and the remote Tuist build-cache.
          purge_gradle_caches() {
            rm -rf gradle/build gradle/.gradle ~/.gradle/caches/build-cache-1 || true
          }

          # Runs `./gradlew assemble --build-cache` and prints total
          # wall-clock seconds. `assemble` not `build`: keeps the run
          # cache-bound rather than test-bound. For the routed phase
          # the dispatch-provided endpoint is exported; for baseline
          # the variable is removed so the plugin uses its default
          # cache resolution.
          # Build failures must fail the smoke run — a timing for a
          # build that errored (or never reached the cache) is worse
          # than no timing, because the job would go green while the
          # routed path is broken. measure_one runs in a command
          # substitution (subshell), so failures are recorded in a
          # sentinel file rather than a shell variable, and the step
          # exits non-zero after the timing table is printed.
          rm -f /tmp/gradle-build-failures
          measure_one() {
            local path=$1
            local kind=$2   # "miss" (purge first) or "hit" (warm cache)
            if [ "$kind" = "miss" ]; then purge_gradle_caches; fi
            local started ended status=0
            started=$(date +%s.%N)
            if [ "$path" = routed ]; then
              ( cd gradle && TUIST_CACHE_ENDPOINT="$ROUTED_ENDPOINT" \
                  ./gradlew --no-daemon -Dorg.gradle.caching=true \
                            --build-cache assemble \
                            >>/tmp/gradle-${path}-${kind}.log 2>&1 ) || status=$?
            else
              ( cd gradle && env -u TUIST_CACHE_ENDPOINT \
                  ./gradlew --no-daemon -Dorg.gradle.caching=true \
                            --build-cache assemble \
                            >>/tmp/gradle-${path}-${kind}.log 2>&1 ) || status=$?
            fi
            ended=$(date +%s.%N)
            if [ "$status" -ne 0 ]; then
              echo "${path}/${kind} exit=${status}" >> /tmp/gradle-build-failures
              echo "BUILD FAILED (${path}/${kind}, exit ${status}) — tail of log:" >&2
              tail -20 "/tmp/gradle-${path}-${kind}.log" >&2
            fi
            python3 -c "import sys; print(f'{float(sys.argv[1]) - float(sys.argv[2]):.3f}')" "$ended" "$started"
          }

          declare -A results
          for kind in miss hit; do
            for path in routed baseline; do
              for i in $(seq 1 "$IT"); do
                echo "::group::$path / $kind / iteration $i"
                t=$(measure_one "$path" "$kind")
                results["$path|$kind|$i"]=$t
                echo "wall-clock: ${t}s"
                tail -30 "/tmp/gradle-${path}-${kind}.log"
                echo "::endgroup::"
              done
            done
          done

          {
            echo "path,kind,iter,seconds"
            for key in "${!results[@]}"; do
              IFS='|' read -r path kind iter <<<"$key"
              echo "$path,$kind,$iter,${results[$key]}"
            done
          } | tee /tmp/results.csv

          if [ -s /tmp/gradle-build-failures ]; then
            echo "::error::Gradle build(s) failed — timings above are not trustworthy"
            cat /tmp/gradle-build-failures
            exit 1
          fi

      - name: Summarize
        run: |
          set -euo pipefail
          python3 - <<'PY'
          import csv, statistics
          rows = []
          with open('/tmp/results.csv') as f:
              for r in csv.DictReader(f):
                  rows.append({"path": r["path"], "kind": r["kind"], "iter": int(r["iter"]), "seconds": float(r["seconds"])})

          def col(kind, path):
              return [r["seconds"] for r in rows if r["kind"] == kind and r["path"] == path]

          def stats(vs):
              if not vs:
                  return None
              sv = sorted(vs)
              return {
                  "n": len(sv),
                  "min": sv[0],
                  "max": sv[-1],
                  "mean": statistics.mean(sv),
                  "median": statistics.median(sv),
              }

          print()
          print(f"{'phase':<10} {'path':<10} {'n':>3} {'min':>8} {'median':>8} {'mean':>8} {'max':>8}   (seconds wall-clock)")
          print("-" * 66)
          for kind in ["miss", "hit"]:
              for path in ["baseline", "routed"]:
                  s = stats(col(kind, path))
                  if s is None:
                      print(f"{kind:<10} {path:<10} {'-':>3}")
                      continue
                  print(f"{kind:<10} {path:<10} {s['n']:>3} {s['min']:>8.2f} {s['median']:>8.2f} {s['mean']:>8.2f} {s['max']:>8.2f}")
              print()

          for kind in ["miss", "hit"]:
              base = stats(col(kind, "baseline"))
              routed = stats(col(kind, "routed"))
              if base and routed and routed["median"] > 0:
                  saved = base["median"] - routed["median"]
                  ratio = base["median"] / routed["median"]
                  print(f"{kind}: routed median is {ratio:.2f}x the speed of baseline ({saved:+.2f}s saved per build)")
          PY

      - name: Upload Gradle logs
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: gradle-logs
          path: /tmp/gradle-*.log
          if-no-files-found: ignore
          retention-days: 7

Real ./gradlew assemble --build-cache runs of the Gradle plugin on a staging kata runner (3 iterations per cell), claim → JIT → TUIST_CACHE_ENDPOINT exported by the runner image from the dispatch response:

phase	path	median wall-clock
cache-miss (cold)	no remote cache	31.7 s
cache-miss (cold)	routed (in-cluster Kura)	7.45 s
cache-hit (warm)	no remote cache	11.5 s
cache-hit (warm)	routed	6.0 s

Cold builds restore from the runner-local node 4.3× faster than recompiling (24s saved per build). Caveat: the “baseline” degraded to no-remote-cache because staging’s legacy public cache endpoints (cache-*-staging.tuist.dev) are down (separate issue — kamal app hung behind a healthy proxy/TLS). The apples-to-apples same-node comparison was completed afterwards — see the next section.

Same-node path A/B: in-cluster vs public ingress (completed)

Measured from a pod on a staging bare-metal runner node under the runner fleet’s exact NetworkPolicies, against the same Kura node over both paths (public leg via a temporary ingress, removed after the run).

Request latency (/up, 40 requests per cell):

	public (LB + TLS + nginx)	private (in-cluster svc)	delta
cold connection p50	21.0 ms	5.0 ms	~4×
cold connection p90	24.6 ms	6.0 ms
warm (reused)	1.8 ms	0.5 ms	~3.5×

~13.7 ms of the public cold path is TLS handshake alone. Cache clients open fresh connections per job step, so the cold row is representative.

Bulk throughput (200 MB artifact, module multipart flow, 3 iterations per cell):

	public	private
upload (20 × 10 MB parts)	40–42 MB/s	88–99 MB/s
download (single GET)	107 MB/s	107–108 MB/s

Interpretation:

Uploads pay ~2.3× through the gateway — with request buffering off, every byte streams synchronously through the nginx workers (TLS decrypt + proxy), which are CPU-bound well before the link is.
Downloads are path-identical same-DC (~1 Gbit ceiling on both), so the gateway is not a download bottleneck when there is no WAN in the path.
The WAN is the real divider: this A/B has no WAN leg (runner and LB share a DC). External runners add ~20 ms RTT, where single-stream TCP becomes window-limited (roughly 12–50 MB/s at common window sizes) plus slow start on every fresh connection — that penalty applies to downloads too. Co-located runners keep the full ~1 Gbit/s with sub-percent variance.

Benchmark tooling for future runs (e.g. a true WAN measurement from a GitHub-hosted runner): tuist/cache-benchmark#2.

Findings & follow-ups surfaced by the E2E (separate issues to file)

Cross-env claim leak: production and staging both receive tuist/tuist webhooks and advertise identical shape-linux-* dispatch labels, so a shape-labeled job races both environments — all early smoke runs silently executed on production runners. Staging-targeting jobs must use tuist-staging-* profile labels.
Legacy staging cache down: cache-eu-central-staging.tuist.dev / cache-us-east-staging.tuist.dev accept TCP/TLS but the app behind kamal-proxy hangs; staging users currently have no default remote cache.
Staging ClickHouse schema drift: runner_jobs.log_archived_at was missing despite schema_migrations claiming the migration ran (applied manually).
Cilium enable-endpoint-routes=true set on staging (documented kata-compat setting; to be persisted in the cilium chart values).

Phase 3 (follow-up PR)

Retry backoff for self-healed nodes, cluster networking for macOS Tart VMs so the Scaleway Apple-Silicon fleet can consume its local nodes (dispatch already gates on Catalog.fleet_on_cluster_network?/1), cluster-locality check in dispatch — cache_endpoint_url must only be handed to runners in the same cluster as the node (svc DNS doesn’t cross clusters; staging hides this because runners and Kura share one cluster, but prod Linux runners live in Hetzner while scw-fr-par-runners nodes would live in Scaleway — this is a hard precondition for activating the Scaleway region), RunnerCache test coverage, residency gating, observability.

Comments

tuist-atlas[bot] Jun 13, 2026

This change is now available in xcresult-processor-image@0.19.0. Update to this version to use it.

tuist-atlas[bot] Jun 13, 2026

The co-located runner-cache Kura nodes feature is now available in runner image runner-image@0.6.0. Update to this version to use the in-cluster cache endpoint routing for your runners.

tuist-atlas[bot] Jun 13, 2026

This change is now available in version linux-runner-image@0.6.0. Update to ghcr.io/tuist/tuist-linux-runner:0.6.0 to use it.