Hive
feat(infra): mirror runner Docker Hub pulls through mirror.gcr.io
GitHub issue · Closed
What changed
Two things, in two commits:
- The fix (
feat(infra)): the Linux runnerdindsidecar’sdockerdnow launches with--registry-mirror=https://mirror.gcr.io, so Docker Hub (docker.io) image pulls route through Google’s public pull-through cache instead of hitting Docker Hub directly. - Bootstrap unblocks (
ci(infra), temporary): therunners-controllerimage build and the server + Kura controller image builds (production cascade + standalone staging deploy) move toubuntu-latest.
Why
We recently enabled Docker support in the Tuist runners. Linux runner Pods run a privileged docker:dind sidecar, and every microVM on a bare-metal host NATs through that host’s single egress IP. Docker Hub rate-limits by IP (100 pulls / 6h unauthenticated), so the whole host shares one budget and CI jobs started failing with toomanyrequests:
buildx failed with: toomanyrequests: You have reached your unauthenticated pull rate limit.
The sidecar’s dockerd had no registry mirror and no authentication, so every docker pull / FROM went straight to registry-1.docker.io from the shared IP.
Root cause of the fix’s chosen shape
mirror.gcr.io is the lowest-effort durable mitigation that needs no infrastructure and no secret: it’s a transparent Docker Hub pull-through cache. GCR absorbs cache misses on its own backend, so the runner’s IP never contacts Docker Hub for docker.io images; dockerd only falls back to Hub directly if the mirror itself is unreachable (fail-safe). It’s hardcoded rather than a Helm knob because there’s no per-environment branch point today and a new controller flag would risk the flag.Parse -> os.Exit(2) CrashLoop skew; the flag here goes to dockerd inside the docker:28-dind sidecar, which supports it, so there’s no skew.
Why the temporary ubuntu-latest moves
Chicken-and-egg: the fix only takes effect once a new controller image is built and the chart redeploys, but those builds run on the rate-limited tuist-linux fleet and pull base images from Docker Hub through the same un-mirrored dind, so they fail with the very error the change fixes. The production deploy’s Build server image job already hit this. ubuntu-latest pulls via GitHub’s rotating IP pool instead of the fleet’s shared egress IP. It’s 4 vCPU / 16 GB (same memory as tuist-linux-large), so the Elixir release compile won’t OOM.
Rollout
- Merge -> controller image builds on
ubuntu-latest, lands in ghcr. - Deploy (production cascade):
Build server imagenow runs onubuntu-latest, helm rolls the mirrored controller. - Verify on a fresh runner pod:
docker infoin the dind sidecar listsmirror.gcr.iounder Registry Mirrors and adocker pullsucceeds. (Existing pods are unaffected; the mirror only applies to newly created pods.) - Revert the three
runs-onchanges back totuist-linux/tuist-linux-largeonce the mirror is fleet-wide. Each carries aTEMPORARY:comment:git grep -n "TEMPORARY: " .github/workflows.
Scope / follow-ups (not in this PR)
- The mirror covers the dind dockerd layer (the job’s own pulls) only. The host containerd pull of
docker:28-dindand the runner image from Docker Hub is a separate dependency on the same shared IP, bounded by per-node image caching. A host-level containerd registry mirror closes it. - Durable fix is a self-hosted pull-through cache backed by our own object storage, removing the dependency on the Google-operated, unauthenticated
mirror.gcr.io. - Watch
ubuntu-latestroot-disk headroom (~14 GB free) for the heavy server image build; if itENOSPCs, add a disk-reclaim step or switch that one job to a buildkit-level mirror.
Validation
go build ./...andgo test ./internal/podtemplate/pass (newTestBuild_LinuxDindUsesRegistryMirrorasserts the flag is wired into the sidecar).- Both deploy workflow YAMLs and the controller-image workflow YAML parse.