🛠️ Tuist Run Report 🛠️
Tests 🧪
| Scheme | Status | Cache hit rate | Tests | Skipped | Ran | Commit |
|---|---|---|---|---|---|---|
| TuistAcceptanceTests | ❌ | 0 % | 0 | 0 | 0 | 7be9643dc |
Builds 🔨
| Scheme | Status | Duration | Commit |
|---|---|---|---|
| TuistAcceptanceTests | ✅ | 4m 30s | a6926569a |
Hive
GitHub issue · Closed
Add docker-in-runner support for the in-house Linux self-hosted GitHub Actions fleet. Linux runner Pods now ship with a privileged docker:dind native sidecar so workflows that need services: containers, docker build, or buildx run natively on our fleet instead of staying on Namespace.
Validated end-to-end on staging: the server build’s docker_build job ran green on tuist-staging-linux after this PR was deployed.
The Linux runner fleet already wraps every Pod in a kata-qemu microVM. That isolation makes it safe to grant a sidecar container privileged: true — the privileged surface is bounded by the per-Pod kernel, not the bare-metal host’s. Without docker-in-runner, jobs that need dockerd have to stay on Namespace runners, fragmenting the CI fleet and paying Namespace for the dockerd they get out of the box.
podtemplate.Build emits a dind init-container with restartPolicy: Always (k8s ≥ 1.29 native sidecar) running the upstream docker:dind image. Sidecar is privileged: true; runner container stays unprivileged. startupProbe: docker info blocks the runner from starting until dockerd is reachable. Mirrors the actions-runner-controller gha-runner-scale-set pattern.Limits == Requests on the runner makes the VM the size we asked for.dind-sock emptyDir at /var/run (both containers) exposes the docker socket.work emptyDir at /home/runner/actions-runner/_work (both containers) makes docker run -v $PWD:/x mounts resolve identically on either side.dind-storage emptyDir with medium: Memory at /var/lib/docker on the sidecar only — tmpfs is the only filesystem inside the microVM that satisfies both overlay2’s xattrs and BuildKit’s checksum-time xattr reads. Disk-backed emptyDir is served through virtio-fs, which forces dockerd onto the vfs driver and then trips buildx anyway.nr_inodes=0 tmpfs remount. kubelet’s medium: Memory emptyDir inherits the kernel default of ~1 inode per 16 KiB; an npm node_modules tree exhausts that long before the byte cap. The sidecar wraps dockerd in sh -c "mount -o remount,nr_inodes=0 /var/lib/docker && exec dockerd ...".--dind-image manager flag + runnersController.dindImage chart value pin the sidecar image (default docker:28-dind, Renovate-bumped).docker-ce-cli, docker-buildx-plugin, docker-compose-plugin from the official Docker apt repo.docker group GID to 123 so the runner user can reach the sidecar’s socket without sudo (sidecar runs dockerd --group=123).dispatch-poll.sh is unchanged from the pre-PR shape.runnersController.dindImage value, threaded to the manager as --dind-image.runnerpool.yaml template renders the Linux pool with runtimeClass: kata-qemu required.sha-009aa3f528ee.Build + push timeout 40m → 60m (the buildx driver swap from PR #10886 removed Namespace’s persistent build cache).ghcr.io/tuist/tuist-buildcache:server (mode=max) so first-built layers get reused on follow-up deploys.helm@3.16.3 kubectl@1.31.3 in mise install args and added mise use --global so the shims actually resolve. Set MISE_GITHUB_ATTESTATIONS: 0 to stop burning the runner user’s GitHub API quota on rerun-heavy days.mise.toml: helm 4.2.0 → 3.16.3 because aqua’s helm provider builds the download URL without the v prefix and 4.x 404s upstream. Revert when aqua’s registry catches up.Server canary 26409481185 — all 9 jobs green including Docker build on tuist-tuist-runner-pool-linux-ubuntu-22-04-runner-0bfc3d76, build wall time 14 min on the staging fleet.
operation not supported on /noora/priv/static — fixed by tmpfs /var/lib/docker.EOF / Cannot connect to the Docker daemon — kata defaulted the VM to 2 GiB; fixed by setting memory limits so kata sizes from them.mix compile ... cannot allocate memory — bumped pod 8 → 16 → 32 → 48 GiB.no space left on device during npm install — tmpfs default inode count exhausted; fixed by nr_inodes=0 remount in sidecar entrypoint.This PR gives workflows dockerd, not Namespace-parity build speed. Each Pod is single-shot — the tmpfs at /var/lib/docker and any RUN --mount=type=cache mounts vaporize when the job exits. Base image pulls (swift:6.2-bookworm ~1 GB, node:22-slim ~150 MB) and mix deps/aube install/swift toolchain steps run cold on every job.
The deploy workflow’s Build + push already gets a registry-backed BuildKit cache (ghcr.io/tuist/tuist-buildcache:server, mode=max) added in this PR. With mode=max the RUN --mount=type=cache mounts (mix/hex, aube, npm, swift toolchain) are exported and re-imported, so mix deps and npm install don’t re-download every build — the same effect as Namespace’s persistent disk cache, just over the network.
What we still pay every build that Namespace doesn’t:
FROM image pulls against external registries (swift:6.2-bookworm ~1 GB compressed, node:22-slim ~150 MB). On Namespace these are a local NVMe read; for us they’re a fresh network pull. ~2-3 min.ghcr.io/tuist/tuist-buildcache:server. ~1-2 min depending on the working set.CPU is not an advantage for us at the current settings: pod.cpuMilli: 2000 gives each runner microVM 2 vCPU vs. Namespace’s namespace-profile-default-with-volume 4 vCPU. The bare-metal AX42-U has 8 cores but we only allocate 2 of them per Pod; the rest sit idle (the original choice was driven by RAM density, not CPU). Bumping pod.cpuMilli is a cheap follow-up if the build turns out CPU-bound rather than I/O-bound.
Realistic shape:
The follow-up items below (node-local pull-through registry mirror, kata direct-volume for /var/lib/docker) close the cache-pull gap; bumping cpuMilli closes the CPU gap if it turns out to matter.
Follow-up ideas, in order of bang-for-buck:
distribution registry or dragonfly/spegel on each bare-metal node; configure dockerd’s registry-mirrors to hit it first. Wipes the base-image pull cost across Pods on the same host. Low complexity, biggest single-step win./var/lib/docker via kata direct-volume. Mount a node-disk local PV into the dind sidecar at /var/lib/docker using kata’s direct-volume feature — bypasses virtio-fs (the reason we’re on tmpfs today) and gives real ext4 inside the microVM with full overlay2 + xattr support. This is also the correctness/portability fix: with real-disk storage, workflows can use the default docker-container buildx driver and skip every tmpfs-related size/inode/driver: docker workaround. Same shape as GitHub-hosted / Namespace; an action that works there works here, no modifications. BuildKit state surviving across jobs is the perf bonus. Higher complexity; needs per-node block device provisioning.buildkitd as a per-host DaemonSet outside the runner Pods; workflows connect via docker buildx create --driver remote tcp://<node>:1234. Single shared cache state; needs auth + tenant isolation if we ever onboard customer workloads.For now: the dind sidecar unblocks the migration off Namespace for correctness (services: containers + docker build work), and registry buildcache gets us partial parity for buildx layer reuse. Closing the rest of the gap is the cache-warming work above.
Docker build CI job onto our own runners. This PR leaves docker_build on namespace-profile-default-with-volume (unchanged from main) so the merge isn’t gated on a self-hosted build running against a shared staging cluster that other branches concurrently deploy to. The dind sidecar + the tuist-staging-linux-large pool are validated and in place; the follow-up flips runs-on: to tuist-staging-linux-large (and bumps the job timeout for the bare-metal fleet’s cold-cache tax) once the controller + pool config is deployed to prod via the release pipeline and the env is no longer contended. That’s the real migration off Namespace for the server image build.runnersController.features.dindImage gate. The flag exists only as a transition safety: canary / production still pin controller 0.3.0, which doesn’t register --dind-image, so an unguarded chart upgrade would crash flag.Parse. Once the release pipeline ships a controller version that includes this PR’s binary changes and bumps runnersController.image.tag in values-managed-common.yaml, every env has a compatible binary and the gate is dead weight. Follow-up PR removes the gate entirely (template line ungated, the value drops from values.yaml + values-managed-staging.yaml).The pod shape is taken from ARC’s gha-runner-scale-set chart almost verbatim. Concrete choices we lifted (each subtle enough that getting it wrong silently breaks something):
initContainer + restartPolicy: Always) with startupProbe: exec docker info, which replaces what would otherwise be a polling loop in the runner’s entrypoint.DOCKER_GROUP_GID=123 env on the sidecar + --group=123 on dockerd + a docker group pinned to GID 123 in the runner image. The runner reaches the socket without sudo only when all three agree.work emptyDir mounted at the same path in both containers (/home/runner/actions-runner/_work for us) so docker run -v $PWD:/x resolves identically on either side.dind-sock emptyDir at /var/run for the socket itself.Where we diverge from ARC, and the reasoning:
/var/lib/docker goes on the node’s disk via overlay2 and can be a PVC for cross-job cache. We’re on bare-metal nodes with kata-qemu microVMs underneath, and virtio-fs can’t carry overlay2’s xattrs — that’s the entire reason our /var/lib/docker lives on tmpfs with the nr_inodes=0 remount. The follow-up “kata direct-volume” item in the Performance section is what would close that gap and make our shape match ARC’s.containerMode: dind vs containerMode: kubernetes. ARC offers a second mode that translates the workflow’s container:/services: blocks into sibling Kubernetes Pods on the cluster — no privileged surface anywhere, real PVC for the work volume. The cost: no docker build. We picked dind because the server image needs docker_build. Worth knowing exists for any future runner pool that only needs services:.docker:dind-rootless variant. Same image, no privileged container. Compelling for ARC users on shared multi-tenant clusters. Less relevant for us because kata-qemu already bounds the privileged blast radius to a microVM.runner-scale-set-listener Pod that pulls GitHub queue depth and scales the pool. We do the equivalent in-process via the Tuist server’s dispatch_for_sa. Same outcome; different code path.What’s reassuring: ARC documents the inter-job cache loss in dind mode as a known limitation and points users at exactly the two answers we use (type=registry/type=gha BuildKit cache exporters) plus containerMode: kubernetes for workloads that don’t need docker build. The places this PR improvises (tmpfs + remount + 48 GiB pod) are kata-substrate workarounds, not improvements over ARC — and the Performance follow-ups move us toward ARC’s standard shape, not away from it.
One ARC user-facing pitfall worth knowing because it affects us too: containers started inside a workflow are siblings of the dind sidecar, not children of the runner container. Workflows that docker network create and try to attach the runner container itself to the new network don’t work the way users expect — the runner is in the Pod’s network namespace, not the dind sidecar’s docker bridge.
go test ./... in infra/runners-controller/.helm lint clean on staging/canary/production values overlays.docker_build job ran green on tuist-staging-linux during canary validation (the runs-on flip itself is a follow-up PR after this lands).runnersController.image.tag will catch up via the release pipeline post-merge.🤖 Generated with Claude Code
| Scheme | Status | Cache hit rate | Tests | Skipped | Ran | Commit |
|---|---|---|---|---|---|---|
| TuistAcceptanceTests | ❌ | 0 % | 0 | 0 | 0 | 7be9643dc |
| Scheme | Status | Duration | Commit |
|---|---|---|---|
| TuistAcceptanceTests | ✅ | 4m 30s | a6926569a |