The self-hosted Docker Hub pull-through cache feature is now available in runners-controller@0.12.0. Update to this version to deploy the registry cache, upstream auth, and ECR Public dind image pinning. The dind-to-cache mirror wiring is included but gated behind features.registryMirror for controlled rollout per environment.
Hive
feat(infra): self-hosted Docker Hub pull-through cache for the Linux runner fleet
GitHub issue · Closed
What
A self-hosted Docker Hub pull-through cache for the Linux runner fleet, plus the wiring to route runner docker pulls through it. Replaces the model where every runner pulls docker.io directly.
Why
After Docker support was enabled in the runners, CI began failing with toomanyrequests. Every microVM on a bare-metal host NATs through that host’s single egress IP, so the whole host shares one Docker Hub per-IP budget (100 pulls / 6h anonymous). A stopgap dockerd --registry-mirror=https://mirror.gcr.io flag was tried and took CI docker down fleet-wide: the controller pod-template change churned the fleet, and the containerd-layer docker:28-dind re-pull then exhausted the budget. That stopgap was reverted (#11113, rollback #11102); this PR is the durable fix.
A pull-through cache turns N anonymous runners into one authenticated cache client with a near-100% hit rate (the same approach Blacksmith and Namespace ship). It also decouples runners from Docker Hub availability and keeps the org credential off the untrusted runner pods.
What’s in this PR
registry-cachecomponent: CNCF distributionregistry:2in proxy mode, S3-backed on the existing object storage under adocker-mirrorprefix, scheduled on the elastic pool (off the runner fleet), with its image sourced from ECR Public so it cannot deadlock bootstrapping itself.runners-cache-egressNetworkPolicy: runner Pods are isolated to public-HTTPS-only egress (and the LB has no hairpin), so the in-cluster ClusterIP cache needs an explicit egress allow. Mirrors the existing dispatch-egress policy.- dind dockerd to cache wiring: when
runnersController.registryMirroris set (auto-derived from the cache Service), the controller stamps the dind sidecar’s dockerd with--registry-mirrorplus a matching--insecure-registry(the cache is plain http in-cluster). Plumbed through a--registry-mirror-urlmanager flag, gated behindfeatures.registryMirrorso the flag can never reach a controller binary that does not recognize it. docker:28-dindto ECR Public: this sidecar image is pulled by host containerd, which the dockerd mirror cannot cover, so it is repinned topublic.ecr.aws/docker/library/docker:28-dind(byte-identical digest, verified withcrane). This closes the containerd-layer exposure that caused the incident.- Upstream auth: each env’s cache authenticates its Docker Hub pulls with a Docker Hub org access token (
DOCKER_HUB_CREDENTIALS, ESO-synced per env), which lifts the cold-fill rate-limit ceiling. The token lives only on the cache, never on the runners.
Enabled vs gated, per env
| component | staging | canary | production |
|---|---|---|---|
cache (registryCache.enabled) |
on | on | on |
upstream auth (proxy.auth.enabled) |
on | on | on |
dind to cache (features.registryMirror) |
off | off | off |
The cache and auth deploy on merge but are inert: nothing routes through them until the dind-to-cache mirror is flipped on, which is the one switch left off.
Security model
Runner pods run untrusted fork workflow code, and the existing credential split keeps org credentials out of that execution context. This PR preserves that: runners reach the cache anonymously in-cluster, and the Docker Hub org token lives only on the trusted cache service. Authenticating the runners directly (for example Docker Pro on the dind) would have undone that property.
Why these choices over the alternatives
- Pull-through cache vs. just authenticating runner pulls: auth alone fixes the rate limit, but it distributes the org token to untrusted runner pods, does not decouple from Docker Hub outages, re-fetches every layer over the internet, and scales Docker Hub traffic with the fleet. The cache wins on all four; a Docker Pro token is a cheap add-on to the cache’s upstream leg, not a substitute.
- ECR Public vs. a ghcr mirror for the dind image: ECR Public is a faithful mirror (identical digest) and already the team’s pattern (postgres, registry), so there is no mirror workflow to maintain.
- Derived mirror URL:
--registry-mirror-urldefaults to this release’s cache Service, so per-env go-live is just flipping the gate (no hardcoded URL, no fullname/namespace footgun).
Validation (staging)
- Cache deploys and
registry:2is healthy; a pull-through served an alpine manifest fetched from Docker Hub (HTTP 200). runners-cache-egress: atuist.dev/runner=truepod reached the cache, while the apiserver stayed blocked (isolation intact).- dind flag pre-flight: a throwaway dind with the exact flags started cleanly, and
docker pull busyboxrouted through the cache (confirmed in the cache access logs). - Controller stamping: a custom controller build deployed to staging stamped
--registry-mirroronto a real runner pod’s dind sidecar. - Upstream auth: the staging
DOCKER_HUB_CREDENTIALSitem syncs via ESO (SecretSynced, both fields).
Rollout after merge
Merging cuts a runners-controller release that carries --registry-mirror-url (and bumps the common image pin), and deploys the authenticated-but-inert caches plus the ECR dind image fleet-wide. Go-live is then one switch per env:
features.registryMirror: trueon staging, then confirm a runner job’sdocker pullshows an authenticated pull-through in the cache logs.- canary (architecturally identical to staging).
- production, where the dind image is already off Docker Hub so the fleet churn no longer spends the Docker Hub budget.
Only the dind-to-cache flip has a hard dependency (the released flag-having controller, which the merge produces). The cache, upstream auth, and ECR dind image have none and go live on merge.