Hive Hive
Sign in

feat(infra): self-hosted Docker Hub pull-through cache for the Linux runner fleet

GitHub issue · Closed

Metadata
Source
tuist/tuist #11106
Updated
Jun 24, 2026
Domains
Compute
Details

What

A self-hosted Docker Hub pull-through cache for the Linux runner fleet, plus the wiring to route runner docker pulls through it. Replaces the model where every runner pulls docker.io directly.

Why

After Docker support was enabled in the runners, CI began failing with toomanyrequests. Every microVM on a bare-metal host NATs through that host’s single egress IP, so the whole host shares one Docker Hub per-IP budget (100 pulls / 6h anonymous). A stopgap dockerd --registry-mirror=https://mirror.gcr.io flag was tried and took CI docker down fleet-wide: the controller pod-template change churned the fleet, and the containerd-layer docker:28-dind re-pull then exhausted the budget. That stopgap was reverted (#11113, rollback #11102); this PR is the durable fix.

A pull-through cache turns N anonymous runners into one authenticated cache client with a near-100% hit rate (the same approach Blacksmith and Namespace ship). It also decouples runners from Docker Hub availability and keeps the org credential off the untrusted runner pods.

What’s in this PR

  • registry-cache component: CNCF distribution registry:2 in proxy mode, S3-backed on the existing object storage under a docker-mirror prefix, scheduled on the elastic pool (off the runner fleet), with its image sourced from ECR Public so it cannot deadlock bootstrapping itself.
  • runners-cache-egress NetworkPolicy: runner Pods are isolated to public-HTTPS-only egress (and the LB has no hairpin), so the in-cluster ClusterIP cache needs an explicit egress allow. Mirrors the existing dispatch-egress policy.
  • dind dockerd to cache wiring: when runnersController.registryMirror is set (auto-derived from the cache Service), the controller stamps the dind sidecar’s dockerd with --registry-mirror plus a matching --insecure-registry (the cache is plain http in-cluster). Plumbed through a --registry-mirror-url manager flag, gated behind features.registryMirror so the flag can never reach a controller binary that does not recognize it.
  • docker:28-dind to ECR Public: this sidecar image is pulled by host containerd, which the dockerd mirror cannot cover, so it is repinned to public.ecr.aws/docker/library/docker:28-dind (byte-identical digest, verified with crane). This closes the containerd-layer exposure that caused the incident.
  • Upstream auth: each env’s cache authenticates its Docker Hub pulls with a Docker Hub org access token (DOCKER_HUB_CREDENTIALS, ESO-synced per env), which lifts the cold-fill rate-limit ceiling. The token lives only on the cache, never on the runners.

Enabled vs gated, per env

component staging canary production
cache (registryCache.enabled) on on on
upstream auth (proxy.auth.enabled) on on on
dind to cache (features.registryMirror) off off off

The cache and auth deploy on merge but are inert: nothing routes through them until the dind-to-cache mirror is flipped on, which is the one switch left off.

Security model

Runner pods run untrusted fork workflow code, and the existing credential split keeps org credentials out of that execution context. This PR preserves that: runners reach the cache anonymously in-cluster, and the Docker Hub org token lives only on the trusted cache service. Authenticating the runners directly (for example Docker Pro on the dind) would have undone that property.

Why these choices over the alternatives

  • Pull-through cache vs. just authenticating runner pulls: auth alone fixes the rate limit, but it distributes the org token to untrusted runner pods, does not decouple from Docker Hub outages, re-fetches every layer over the internet, and scales Docker Hub traffic with the fleet. The cache wins on all four; a Docker Pro token is a cheap add-on to the cache’s upstream leg, not a substitute.
  • ECR Public vs. a ghcr mirror for the dind image: ECR Public is a faithful mirror (identical digest) and already the team’s pattern (postgres, registry), so there is no mirror workflow to maintain.
  • Derived mirror URL: --registry-mirror-url defaults to this release’s cache Service, so per-env go-live is just flipping the gate (no hardcoded URL, no fullname/namespace footgun).

Validation (staging)

  • Cache deploys and registry:2 is healthy; a pull-through served an alpine manifest fetched from Docker Hub (HTTP 200).
  • runners-cache-egress: a tuist.dev/runner=true pod reached the cache, while the apiserver stayed blocked (isolation intact).
  • dind flag pre-flight: a throwaway dind with the exact flags started cleanly, and docker pull busybox routed through the cache (confirmed in the cache access logs).
  • Controller stamping: a custom controller build deployed to staging stamped --registry-mirror onto a real runner pod’s dind sidecar.
  • Upstream auth: the staging DOCKER_HUB_CREDENTIALS item syncs via ESO (SecretSynced, both fields).

Rollout after merge

Merging cuts a runners-controller release that carries --registry-mirror-url (and bumps the common image pin), and deploys the authenticated-but-inert caches plus the ECR dind image fleet-wide. Go-live is then one switch per env:

  1. features.registryMirror: true on staging, then confirm a runner job’s docker pull shows an authenticated pull-through in the cache logs.
  2. canary (architecturally identical to staging).
  3. production, where the dind image is already off Docker Hub so the fleet churn no longer spends the Docker Hub budget.

Only the dind-to-cache flip has a hard dependency (the released flag-having controller, which the merge produces). The cache, upstream auth, and ECR dind image have none and go live on merge.

Comments
TA
tuist-atlas[bot] Jun 9, 2026

The self-hosted Docker Hub pull-through cache feature is now available in runners-controller@0.12.0. Update to this version to deploy the registry cache, upstream auth, and ECR Public dind image pinning. The dind-to-cache mirror wiring is included but gated behind features.registryMirror for controlled rollout per environment.