Hive
fix(infra): roll runners-controller back to 0.7.0 to restore CI docker
GitHub issue · Closed
Incident: CI docker down fleet-wide
Controller 0.8.0 was deployed to production at ~17:37 UTC (via a server-deployment.yml workflow_dispatch, ✓ Deploy to tuist). Since then, server CI Test jobs fail immediately at Initialize containers:
docker version --format '{{.Server.APIVersion}}'
failed to connect to the docker API at unix:///var/run/docker.sock ... no such file or directory
The dind sidecar isn’t coming up in new runner pods, so there’s no docker for jobs that use service containers.
Root cause
0.8.0‘s only change to the runner pod template is the dind --registry-mirror=https://mirror.gcr.io flag from #11096 (the other 0.8.0 changes are autoscaler logic, which doesn’t touch the template). Adding the flag changed the dockerd command, hence the pod-template hash, churning the fleet onto new pods. The new pods’ dind sidecar never produced /var/run/docker.sock.
Empirical timeline (server.yml Test job):
| Time (UTC) | docker | note |
|---|---|---|
| 17:28 | ✅ works | still on pre-0.8.0 pods |
| ~17:37 | — | 0.8.0 finishes rolling to prod |
| 17:50, 20:41, 20:55, 21:05 | ❌ socket missing | 100% of jobs reaching the docker step |
The consistency (not intermittent) is the tell that it’s the controller change, not transient rate-limit starvation.
Most likely mechanism: the fleet-wide pod churn forced a re-pull storm of docker:28-dind, which hits the containerd-layer Docker Hub pull that the dockerd mirror does not cover, so sidecars land in ImagePullBackOff. (Confirm on a stuck pod: kubectl -n tuist-runners describe pod <pod> — ImagePullBackOff/toomanyrequests on docker:28-dind vs CrashLoopBackOff if dockerd rejected the flag.)
What this PR does
Pin runnersController.image.tag back to the known-good 0.7.0. The deployed controller reverts to the working pod template.
Pin-only on purpose: values-managed-common.yaml is not under infra/runners-controller/**, so this triggers no controller release — it just redeploys 0.7.0.
Side effect: 0.7.0 predates the 0.8.0 autoscaler/fleet-allocation refactor; that reships with the next clean controller release.
Deploy
Use the path that worked at 17:37: a server-deployment.yml workflow_dispatch to production (the full production cascade has been failing on the separate canary / Platform chart “lost communication” issue). The Build server image step is already on ubuntu-latest, so it won’t hit the rate limit.
⚠️ Follow-up required (or this rollback gets undone)
The --registry-mirror flag is still in main’s podtemplate.go. The next controller release will rebuild it and rebump this pin, re-breaking the fleet. Before any controller release:
- Revert (or guard) the flag in
podtemplate.go, and - Reship the mirror only once the containerd-layer
docker:28-dindpull is also mirrored/authed (host containerd mirror, Docker Hub auth on the fleet, or a pre-pulled/baked sidecar image) so deploy-time pod churn can’t trip the rate limit.
Validation
YAML parses. Single-line pin change (0.8.0 → 0.7.0) plus a guard comment.