Hive
fix(infra): disable tart-kubelet orphan-VM GC on builder nodes
GitHub issue · Closed
What changed
Adds a --disable-vm-gc flag to tart-kubelet that turns off the periodic orphan-VM garbage collector, and wires it on automatically for builder-fleet Nodes via macos-host-bootstrap. Pure Nodes are unaffected.
infra/tart-kubelet/cmd/tart-kubelet/main.go— new--disable-vm-gcflag (defaultfalse). When set, thepodagent.Collectoris never constructed or added to the manager, and the reconciler gets a nilGC(its existingif r.GC != nilguard then skips the reactive no-space path too).infra/macos-host-bootstrap/bootstrap.go—renderLaunchdPlistemits--disable-vm-gcwhencfg.GHActionsRunner != nil(builder hosts), mirroring the conditional rendering of--node-labels/--provider-id. The CAPI provider already setsGHActionsRunneronly for builder machines, so no provider change is needed.infra/vm-image-builder.md— corrects the “sits beside tart-kubelet without conflict” claim this bug disproved.- Tests: two new
renderLaunchdPlistcases (flag present for builders, absent for pure Nodes).
Why
Release image bakes were failing at tart push with The file "nvram.bin" doesn't exist, independently on each builder host (both release-runner-image and release-xcresult-processor-image in the same release run, on two separate runners). Example: https://github.com/tuist/tuist/actions/runs/26895463934/job/79335847508
Root cause
Builder Mac minis are full Kubernetes Nodes running tart-kubelet. Its periodic orphan-VM GC (internal/podagent/garbage.go, 5-minute interval) deletes every local Tart VM not backed by a Pod scheduled to that Node. Builder Nodes never have Pods scheduled — they bake images with a host-level Packer/tart process — so the freshly-built VM (tuist-runner / tuist-xcresult-processor) is always classified as an orphan and tart deleted.
tart push uploads config → disk → NVRAM. The disk stage takes ~5+ minutes, which meets/exceeds the GC interval, so a GC pass lands mid-push and unlinks the VM bundle. The already-open disk.img survives on its file descriptor (so the disk upload completes), but nvram.bin is opened only after the disk finishes — by then the bundle is gone, hence the failure at the NVRAM layer.
This explains why both bakes failed identically on different hosts, why it is effectively deterministic for any push longer than the GC interval, and why earlier image releases succeeded (they ran on the old hand-bootstrapped builder hosts that did not run tart-kubelet).
Why this approach
The GC exists to reclaim disk from terminated pod-VMs and stale OCI cache. Neither exists on a builder (no Pods are scheduled), so the GC has nothing legitimate to collect there — it is pure downside. Builders already reclaim their own disk via the image-bake workflow’s “Reclaim Tart disk” step. Disabling the GC on builder Nodes is therefore the most surgical fix and matches the documented “kubelet stays idle on builders” intent. A name-based ownership guard in the GC was considered as durable hardening but is out of scope here.
Rollout note
tart-kubelet and macos-host-bootstrap are baked into the capi-provider-scaleway-applesilicon operator image. This fix takes effect on live builders only after the operator image is rebuilt and the builder hosts are re-bootstrapped/rolled so their launchd plist is re-rendered with --disable-vm-gc. Until then, re-running the failed release jobs will fail again; the manual stop-gap is to stop dev.tuist.tart-kubelet on each builder for the duration of a dispatched bake.
How to test locally
cd infra/tart-kubelet && go build ./... && go vet ./... && go test ./...cd infra/macos-host-bootstrap && go build ./... && go vet ./... && go test ./...
The new renderLaunchdPlist tests assert --disable-vm-gc is present when GHActionsRunner is set and absent otherwise.
No GitHub comments yet.