Hive
fix: stop xcresult-processor VM disk leak, restore its metrics, and surface guest DiskPressure
GitHub issue · Closed
Summary
Fixes from an incident where ProcessXcresultWorker jobs were failing with :enospc on the production macOS xcresult-processor VMs. They’re drafted together because they share one thread: a slow resource leak that ran unmonitored because the VMs’ observability had been silently broken since the Tailscale migration on 2026-05-18.
fix(server)— stop the xcresult NIF from leaking attachment temp dirs (the actual disk filler).fix(infra)— restore the tart-kubelet metrics forwarder so the VMs’ Oban job metrics reach Prometheus again (why the leak was invisible on the dashboard).feat(infra)— make a full guest disk observable: publish aDiskPressurenode condition and export the guest disk usage % as a Prometheus gauge to alert on, instead of it sitting atUnknown.
What happened
ProcessXcresultWorker runs only on the macOS xcresult-processor VMs (xcresulttool is Xcode-only). On the affected VM the guest APFS volume was at 100% — 130 GiB disk, ~58 GiB Xcode baseline, and ~47 GiB of leaked xcresult-attachments-* directories in the worker’s TMPDIR (14k+ dirs, one per job over ~8 days). Once full, downloads/extractions failed with :enospc; the one attempt that squeezed through then failed to parse a partially-written bundle, and Oban discarded after 5 attempts. The run was persisted as failed_processing.
It stayed invisible because (a) every job-event Grafana panel showed No data — the VM metrics weren’t being scraped — and (b) the node’s DiskPressure condition was stuck Unknown and nothing exported the guest disk %, so neither the scheduler nor any alert could react. The host disk, meanwhile, had 270+ GiB free, so a host-level signal would never have fired.
Fix 1 — attachment temp-dir leak (server)
XCResultParser.attachmentsByTestIdentifiers exported attachments into a process-wide makeTemporaryDirectory(prefix: "xcresult-attachments") and never cleaned it up. It can’t use the auto-cleaning runInTemporaryDirectory block its sibling calls use, because the exported files are read by the Elixir worker for S3 upload after parse() returns.
The worker already creates a per-run scratch dir (root_dir) and File.rm_rfs it once processing finishes. The fix exports attachments into a subdirectory of that caller-provided rootDirectory, so the worker’s existing cleanup reclaims them. Falls back to a temp dir only when no root is provided.
Fix 2 — metrics forwarder dial scope (infra)
tart-kubelet’s host-side metrics reverse proxy returned 502 for every scrape of the VM’s PromEx endpoint. The VM serves the event metrics fine (curl 192.168.64.2:9091 → 200 with live process_xcresult series) and the host reaches the VM directly (curl from the host → 200, 6/6) — but the forwarder’s Go dialer got EHOSTUNREACH:
WARN metrics forwarder: upstream proxy error listen=100.66.118.128:9091
err="dial tcp 192.168.64.2:9091: connect: no route to host"
After the Tailscale migration the forwarder binds to the tailnet IP and the dialer issues an unscoped connect(). On macOS with scoped routing the VM’s vmnet route carries the IFSCOPE flag (bound to the bridge) while the host’s primary interface is the public WAN, so the unscoped connect resolves against the wrong scope and fails — even though the bridge is up and reachable.
The fix adds a Dialer.Control hook that pins the upstream socket to the interface owning the directly-connected route to the target (IP_BOUND_IF on darwin, no-op elsewhere via a build-tagged file). go mod tidy also promotes x/sys to direct and corrects a stale indirect marking on prometheus/client_golang.
Fix 3 — make a full guest disk observable (infra)
tart-kubelet only posted NodeReady, leaving DiskPressure/MemoryPressure/PIDPressure at Unknown and exporting nothing about the guest volume. Two complementary signals are added, both driven by one tart exec ... df probe per running VM each heartbeat:
DiskPressurenode condition — reportedTruewhen any guest root volume is at/above 90% (message names the offending VM), elseFalse; seededFalseat registration so it never sits atUnknown. Gives the scheduler adisk-pressuretaint and is alertable via kube-state-metrics. Probes are time-bounded so an unresponsive guest agent can’t stall the heartbeat; per-VM errors are logged and skipped.tart_kubelet_guest_disk_usage_percent{vm=...}gauge — the gradient signal behind the condition, registered on the controller-runtime registry that Alloy already scrapes via thetuist-macos-tart-kubeletjob (:8080). Reset each sweep so a departed VM stops reporting stale capacity.
Measuring the guest volume (not the host) is deliberate — the host stays near-empty even when a guest is full. (MemoryPressure/PIDPressure remain unreported — a separate gap, not the incident’s failure mode.)
Alerting (Grafana-managed — wire these up, not in this repo)
Alerts here are managed in Grafana Cloud, not as code, so this PR ships the metric; the alert is created in Grafana. Recommended rules on the new gauge (this is what would have caught the 8-day climb long before ENOSPC):
- Warn (notify):
max by (instance, vm) (tart_kubelet_guest_disk_usage_percent) >= 80for 10m - Page (matches the condition threshold):
max by (instance, vm) (tart_kubelet_guest_disk_usage_percent) >= 90for 5m - Probe/target down (so a blind spot is itself an alert):
up{job="tuist-macos-tart-kubelet"} == 0for 10m - Optional backstop on the condition:
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
This PR also adds a Guest Disk Usage panel (tart_kubelet_guest_disk_usage_percent, threshold lines at 80/90%) to the “Tuist xcresult Processor” dashboard JSON (infra/grafana-dashboards/xcresult-processor.json), which is synced as code.
Validation
swift test(xcresult_nif): 6/6 pass, incl. a new test asserting attachments land under the provided root.go test ./...(tart-kubelet): pass, incl. new tests for the dial interface-resolution helper, thedfcapacity parser, the DiskPressure condition logic (True/False/keep-prior-on-error/nil-noop), and the guest-disk gauge (set + reset-drops-stale-series).- tart-kubelet builds for darwin/arm64 and linux/amd64;
go vet/gofmt/swiftformat --lintclean. - Incident mitigation already applied out-of-band: leaked dirs cleared on both production VMs to restore service.
⚠️ The forwarder fix, the guest-df probe, and the gauge can only be fully confirmed on a Mac mini host (real macOS scoped routing / the Tart guest agent). Post-deploy checks: curl <node-tailnet-ip>:9091/metrics returns 200 (not 502); kubectl describe node shows a real DiskPressure status; tart_kubelet_guest_disk_usage_percent appears in Grafana.
Deliberately deferred
kubectl logs/exec against the Tart VM nodes — not implemented (options outlined for a future decision)
This is not a config fix; it’s three independent, stacked failures:
- The apiserver flag at
infra/k8s/clusters/clusterclass-tuist.yaml:425isExternalIP,Hostname,InternalDNS,ExternalDNS— Tart nodes have no ExternalIP, so it falls to an unresolvable Hostname (theno such hosterror). - The nodes’ only routable address is their Tailscale CGNAT IP, which the control-plane netns can’t reach (those routes live only on the egress/connector proxies).
- tart-kubelet implements no kubelet streaming server on
:10250, so there’s nowhere for/exec//containerLogsto land even with DNS + routing solved.
Native kubectl logs/exec would require all three layers (none is independently useful), ~10–13d total:
- Streaming server on
:10250(~5–8d) — serve/containerLogs(plain HTTP stream) +/exec(SPDY/WebSocket upgrade, stdio channel demux). TLS serving cert via the CSR API, validate the apiserver’s client cert against the cluster CA, SAR webhook for authz — the model the Linux kubelet config already specifies atclusterclass-tuist.yaml:471-489, and tart-kubelet already holds a cluster-CA kubeconfig to bootstrap it. Backexecwithtart exec -i -t,logswith the guest service log file;k8s.io/kubelet/.../cri/streamingcovers most ofexec. Hardest part is the SPDY/WS channel demux, not thetart execglue. - Reachability (~2–4d) — the control plane isn’t on the tailnet (only the egress ProxyGroup is). Recommended: join CP nodes to the tailnet with a scoped route-accept + an ACL grant to
minis:10250. - Flag (<1d) — change the global arg to
ExternalIP,InternalIP,Hostname,…so Linux nodes still matchExternalIPfirst (unchanged) and Tart nodes fall through to their now-routableInternalIP. Triggers a control-plane rollout; validate on canary.
Cheaper alternatives covering ~90% of the value in ~3d (preferred unless tooling requires the standard kubectl UX):
- Ship the guest service logs to Loki via the existing observability/egress path — operators read logs in Grafana and
kubectl logsbecomes unnecessary. Best value/effort. - A
tart-kubelet exec <pod>host subcommand over the existing tailnet SSH path (tag:tuist-ops → mini:22) for occasional interactive debugging — no apiserver/control-plane changes.
Until one of these is built, operate these nodes via host SSH + tart exec (as done in this incident).
Other
- The stuck production helm pipeline (prod is on xcresult-processor
0.6.2; chart is at0.9.0since 2026-05-20) — none of these fixes reach prod until that’s unblocked.
🤖 Generated with Claude Code