feat(infra): isolate the runner dispatch token and bring fork CI onto the fleet

Metadata

Source

tuist/tuist #11062

Updated

Jun 24, 2026

Domains

Compute

Details

What changed

Two parts, on the same rollout:

1. Token isolation (the split). Splits the Linux runner Pod into two containers so untrusted workflow code (including fork PRs) never shares a container with the dispatch ServiceAccount token.

poller init container — the only container that mounts the audience-scoped projected token. Runs dispatch-poll.sh in poller mode; on a claim it stages the minted, job-scoped JIT onto a shared tuist-runner-jit emptyDir and exits. Declared after the dind sidecar so it inherits the same docker info startup gate the runner container had before the split.
runner main container — holds no token and no dispatch env. kubelet only starts it after the poller exits, so the JIT is already staged; it runs the new run-job.sh, which reads the JIT and execs the workflow under only that single-job credential.

2. Bring the movable CI jobs onto the fleet. With the split in place the fleet matches the GitHub-hosted / Namespace security shape, so the jobs that don’t depend on GitHub-hosted tooling move to the in-house runners:

gradle Test — the genuine fork-code job this unblocks: a single job with no job-level fork exclusion that runs ./gradlew test on the PR’s code (only the tuist auth login step is fork-gated). Pure JVM, runs clean on tuist-linux.
preview-deploy Build+push — its docker gate is satisfied (#11055/#11059 added the dind sidecar), so the buildx-to-GHCR job runs on tuist-linux-large (the server-image Elixir release compile OOMs the default microVM).

Why

The dispatch token is pool-scoped — a Pod that reads it could race the warm pool to claim other tenants’ queued workflow_jobs. The token already can’t touch the K8s API (wrong audience, automount off, 1h TTL, SA GC’d on Pod exit), but for trusted jobs we accepted it living in the runner container; untrusted fork code can’t. The JIT, by contrast, is job-scoped — it binds the runner to exactly one workflow run — so a runner already operating under it loses nothing by holding it. Moving the token into a credential-free-runner shape mirrors the GitHub-hosted / Namespace push model and closes the gap that kept fork-code CI on GitHub-hosted runners (the last thing keeping CI off the in-house fleet besides ARM and macOS-Xcode release builds).

Left external on purpose: the android jobs (Test/Build and their fork variants) — ./gradlew needs the Android SDK (ANDROID_HOME), which GitHub’s hosted image pre-installs but the general-purpose tuist-linux image lacks, so moving them needs the SDK baked into the runner image first (follow-up); secret-scanning TruffleHog (its action can’t resolve a commit range on a self-hosted runner); and preview-deploy’s resolve/start/teardown orchestration (lightweight internal jobs, no fleet benefit).

Reviewer notes

Warm Pods are now Pending, not Running. With the poller as an init container, a warm-standby Linux Pod sits in Pending (poller polling in Init) until it claims a job. The stale-Pending reap in runnerpool_controller.go therefore gains an isIdle guard so an image roll racing a claim can’t reap a Pod that’s momentarily Pending right after claiming. macOS Pods stay Running (tart-kubelet), so this path only ever matched idle Pods for them anyway.
Billing unaffected. pod-lifecycle keys on the runner container’s terminated.finishedAt; the poller and dind are init containers, absent from containerStatuses, so billing still anchors on exactly the customer job’s runtime.
Drain finalizer unaffected. It’s label-based (runner-pool-owner), phase-independent — a claimed Pod is held whether Pending or Running.
Server unchanged. The 410 stale-image check reads containers[0].image (still the runner) and the owner-label stamp patches the Pod by name.
Poller runs as runAsUser: 0 purely so it can write the root-owned JIT emptyDir; it executes only our poll script, never customer code, and the namespace already hosts the privileged dind sidecar. The runner container that runs the workflow stays non-root (image USER).
macOS keeps the single-container shape — the Tart VM is the isolation boundary and tart-kubelet projects the token into it.

Rollout ordering (important)

Ship the runner image carrying run-job.sh + the poller-mode dispatch-poll.sh and repin runnersFleetLinux.pools[].runnerImage to it.
Deploy the split-aware controller.
Only then do the fork-job workflow moves take effect safely (they take effect on merge to main).

A new controller on an old image would set TUIST_RUNNER_JIT_OUTPUT_PATH against a dispatch-poll.sh that ignores it and exec the job inside the poller with the token still mounted. The reverse (old controller, new image) is safe: with the env unset the new script keeps the legacy exec path. The commits are scoped separately (feat(linux-runner-image) cuts the image release; feat(infra) ships the controller; ci(infra) moves the jobs).

Validation

go build, go vet, full go test ./... on the runners-controller module — all pass.
New tests: TestBuild_LinuxCredentialSplit (runner has no token/dispatch env, JIT ro; poller holds the token ro, JIT rw, runs as root, has JIT_OUTPUT_PATH), TestBuild_MacOSHasNoPollerOrTokenVolume, and TestReconcile_LeavesStalePendingClaimedPodAlone (the isIdle guard). Existing Linux tests updated for the 2-init-container shape.
gofmt -l clean; shellcheck clean on both scripts; the image workflow now bash -ns run-job.sh too.
The repointed workflows parse under actionlint (only the expected unknown-custom-label notices for tuist-linux, same as every other in-house job).

Canary validation

The split adds one per-job startup step: kubelet starts the runner container after the poller stages the JIT, instead of the previous in-process exec. The kata VM is already booted and the image cached, so this is a container-add into a live sandbox — estimated sub-second to ~1.5s (PLEG-mode dependent), downstream of the claim, so it doesn’t touch throughput, fleet density, or the warm-pool claim rate. Confirm the real delta on canary via the claim_to_running_time_ms histogram (runners PromEx plugin) before promoting to production. Scheduling footprint is unchanged — the poller init container carries no resource requests — so no capacity re-planning is needed.

🤖 Generated with Claude Code

Comments

GA

github-actions[bot] Jun 3, 2026

🚨 TruffleHog Secret Scan Failed

Verified secrets were detected in this pull request.

Please take the following actions:

Rotate the exposed credential(s) immediately - assume they are compromised
Remove the secret from your code - use environment variables or a secrets manager instead
If the secret was committed previously, you may need to rewrite git history using git filter-repo or similar tools

For more information, check the workflow run logs.

F

fortmarek Jun 3, 2026

Re: the TruffleHog “verified secrets” comment above — that is a false positive, not a real secret. No credential is in this diff.

What happened: this PR moved the secret-scanning job to the in-house tuist-linux fleet, and TruffleHog’s GitHub Action can’t resolve a commit range on a self-hosted runner — it falls back to scanning file:///tmp, which isn’t a git repo, so the scan errors out (COMMIT_IDS: [] … '/tmp' does not appear to be a git repository). The job runs with continue-on-error: true and an if: steps.trufflehog.outcome == 'failure' comment step, so it reports the scanner error as “secrets detected” (the job’s own “Fail if secrets detected” step warns to distinguish the two).

Fix pushed (6ad73bb): reverted secret-scanning back to ubuntu-latest. It’s fork != true (internal-only), so it never runs untrusted code and gained nothing from the fleet — it was swept in by mistake. Same revert applied to the internal-only preview-deploy resolve/start/teardown jobs. Only the genuinely fork-code-executing jobs (android Test/Build Fork, gradle Test) remain on tuist-linux. The next scan run will be green.

T

tuist[bot] Jun 4, 2026

🛠️ Tuist Run Report 🛠️

Tests 🧪

Project	Status	Tests	Commit
tuist-gradle-plugin	✅	177	66b874b27

Builds 🔨

Project	Status	Duration	Commit
tuist-gradle-plugin	✅	1m 55s	66b874b27

TA

tuist-atlas[bot] Jun 5, 2026

This change is now available in runners-controller@0.7.0. Update to this version to use the isolated runner dispatch token and bring fork CI onto the fleet.