Hive Hive
Sign in

fix(infra): roll runners-controller back to 0.7.0 to restore CI docker

GitHub issue · Closed

Metadata
Source
tuist/tuist #11102
Updated
Jun 24, 2026
Domains
Compute
Details

Incident: CI docker down fleet-wide

Controller 0.8.0 was deployed to production at ~17:37 UTC (via a server-deployment.yml workflow_dispatch, ✓ Deploy to tuist). Since then, server CI Test jobs fail immediately at Initialize containers:

docker version --format '{{.Server.APIVersion}}'
failed to connect to the docker API at unix:///var/run/docker.sock ... no such file or directory

The dind sidecar isn’t coming up in new runner pods, so there’s no docker for jobs that use service containers.

Root cause

0.8.0‘s only change to the runner pod template is the dind --registry-mirror=https://mirror.gcr.io flag from #11096 (the other 0.8.0 changes are autoscaler logic, which doesn’t touch the template). Adding the flag changed the dockerd command, hence the pod-template hash, churning the fleet onto new pods. The new pods’ dind sidecar never produced /var/run/docker.sock.

Empirical timeline (server.yml Test job):

Time (UTC) docker note
17:28 ✅ works still on pre-0.8.0 pods
~17:37 0.8.0 finishes rolling to prod
17:50, 20:41, 20:55, 21:05 ❌ socket missing 100% of jobs reaching the docker step

The consistency (not intermittent) is the tell that it’s the controller change, not transient rate-limit starvation.

Most likely mechanism: the fleet-wide pod churn forced a re-pull storm of docker:28-dind, which hits the containerd-layer Docker Hub pull that the dockerd mirror does not cover, so sidecars land in ImagePullBackOff. (Confirm on a stuck pod: kubectl -n tuist-runners describe pod <pod>ImagePullBackOff/toomanyrequests on docker:28-dind vs CrashLoopBackOff if dockerd rejected the flag.)

What this PR does

Pin runnersController.image.tag back to the known-good 0.7.0. The deployed controller reverts to the working pod template.

Pin-only on purpose: values-managed-common.yaml is not under infra/runners-controller/**, so this triggers no controller release — it just redeploys 0.7.0.

Side effect: 0.7.0 predates the 0.8.0 autoscaler/fleet-allocation refactor; that reships with the next clean controller release.

Deploy

Use the path that worked at 17:37: a server-deployment.yml workflow_dispatch to production (the full production cascade has been failing on the separate canary / Platform chart “lost communication” issue). The Build server image step is already on ubuntu-latest, so it won’t hit the rate limit.

⚠️ Follow-up required (or this rollback gets undone)

The --registry-mirror flag is still in main’s podtemplate.go. The next controller release will rebuild it and rebump this pin, re-breaking the fleet. Before any controller release:

  1. Revert (or guard) the flag in podtemplate.go, and
  2. Reship the mirror only once the containerd-layer docker:28-dind pull is also mirrored/authed (host containerd mirror, Docker Hub auth on the fleet, or a pre-pulled/baked sidecar image) so deploy-time pod churn can’t trip the rate limit.

Validation

YAML parses. Single-line pin change (0.8.00.7.0) plus a guard comment.

Comments
TA
tuist-atlas[bot] Jun 6, 2026

The fix rolling runners-controller back to 0.7.0 to restore CI docker is now available in runner-image@0.2.1. Update to this version to get the change.