What changed
Two parts, on the same rollout:
1. Token isolation (the split). Splits the Linux runner Pod into two containers so untrusted workflow code (including fork PRs) never shares a container with the dispatch ServiceAccount token.
poller init container — the only container that mounts the audience-scoped projected token. Runs dispatch-poll.sh in poller mode; on a claim it stages the minted, job-scoped JIT onto a shared tuist-runner-jit emptyDir and exits. Declared after the dind sidecar so it inherits the same docker info startup gate the runner container had before the split.
runner main container — holds no token and no dispatch env. kubelet only starts it after the poller exits, so the JIT is already staged; it runs the new run-job.sh, which reads the JIT and execs the workflow under only that single-job credential.
2. Bring the movable CI jobs onto the fleet. With the split in place the fleet matches the GitHub-hosted / Namespace security shape, so the jobs that don’t depend on GitHub-hosted tooling move to the in-house runners:
gradle Test — the genuine fork-code job this unblocks: a single job with no job-level fork exclusion that runs ./gradlew test on the PR’s code (only the tuist auth login step is fork-gated). Pure JVM, runs clean on tuist-linux.
preview-deploy Build+push — its docker gate is satisfied (#11055/#11059 added the dind sidecar), so the buildx-to-GHCR job runs on tuist-linux-large (the server-image Elixir release compile OOMs the default microVM).
Why
The dispatch token is pool-scoped — a Pod that reads it could race the warm pool to claim other tenants’ queued workflow_jobs. The token already can’t touch the K8s API (wrong audience, automount off, 1h TTL, SA GC’d on Pod exit), but for trusted jobs we accepted it living in the runner container; untrusted fork code can’t. The JIT, by contrast, is job-scoped — it binds the runner to exactly one workflow run — so a runner already operating under it loses nothing by holding it. Moving the token into a credential-free-runner shape mirrors the GitHub-hosted / Namespace push model and closes the gap that kept fork-code CI on GitHub-hosted runners (the last thing keeping CI off the in-house fleet besides ARM and macOS-Xcode release builds).
Left external on purpose: the android jobs (Test/Build and their fork variants) — ./gradlew needs the Android SDK (ANDROID_HOME), which GitHub’s hosted image pre-installs but the general-purpose tuist-linux image lacks, so moving them needs the SDK baked into the runner image first (follow-up); secret-scanning TruffleHog (its action can’t resolve a commit range on a self-hosted runner); and preview-deploy’s resolve/start/teardown orchestration (lightweight internal jobs, no fleet benefit).
Reviewer notes
- Warm Pods are now
Pending, not Running. With the poller as an init container, a warm-standby Linux Pod sits in Pending (poller polling in Init) until it claims a job. The stale-Pending reap in runnerpool_controller.go therefore gains an isIdle guard so an image roll racing a claim can’t reap a Pod that’s momentarily Pending right after claiming. macOS Pods stay Running (tart-kubelet), so this path only ever matched idle Pods for them anyway.
- Billing unaffected.
pod-lifecycle keys on the runner container’s terminated.finishedAt; the poller and dind are init containers, absent from containerStatuses, so billing still anchors on exactly the customer job’s runtime.
- Drain finalizer unaffected. It’s label-based (
runner-pool-owner), phase-independent — a claimed Pod is held whether Pending or Running.
- Server unchanged. The 410 stale-image check reads
containers[0].image (still the runner) and the owner-label stamp patches the Pod by name.
- Poller runs as
runAsUser: 0 purely so it can write the root-owned JIT emptyDir; it executes only our poll script, never customer code, and the namespace already hosts the privileged dind sidecar. The runner container that runs the workflow stays non-root (image USER).
- macOS keeps the single-container shape — the Tart VM is the isolation boundary and tart-kubelet projects the token into it.
Rollout ordering (important)
- Ship the runner image carrying
run-job.sh + the poller-mode dispatch-poll.sh and repin runnersFleetLinux.pools[].runnerImage to it.
- Deploy the split-aware controller.
- Only then do the fork-job workflow moves take effect safely (they take effect on merge to
main).
A new controller on an old image would set TUIST_RUNNER_JIT_OUTPUT_PATH against a dispatch-poll.sh that ignores it and exec the job inside the poller with the token still mounted. The reverse (old controller, new image) is safe: with the env unset the new script keeps the legacy exec path. The commits are scoped separately (feat(linux-runner-image) cuts the image release; feat(infra) ships the controller; ci(infra) moves the jobs).
Validation
go build, go vet, full go test ./... on the runners-controller module — all pass.
- New tests:
TestBuild_LinuxCredentialSplit (runner has no token/dispatch env, JIT ro; poller holds the token ro, JIT rw, runs as root, has JIT_OUTPUT_PATH), TestBuild_MacOSHasNoPollerOrTokenVolume, and TestReconcile_LeavesStalePendingClaimedPodAlone (the isIdle guard). Existing Linux tests updated for the 2-init-container shape.
gofmt -l clean; shellcheck clean on both scripts; the image workflow now bash -ns run-job.sh too.
- The repointed workflows parse under
actionlint (only the expected unknown-custom-label notices for tuist-linux, same as every other in-house job).
Canary validation
The split adds one per-job startup step: kubelet starts the runner container after the poller stages the JIT, instead of the previous in-process exec. The kata VM is already booted and the image cached, so this is a container-add into a live sandbox — estimated sub-second to ~1.5s (PLEG-mode dependent), downstream of the claim, so it doesn’t touch throughput, fleet density, or the warm-pool claim rate. Confirm the real delta on canary via the claim_to_running_time_ms histogram (runners PromEx plugin) before promoting to production. Scheduling footprint is unchanged — the poller init container carries no resource requests — so no capacity re-planning is needed.
🤖 Generated with Claude Code