Hive Hive
Sign in

fix: stop xcresult-processor VM disk leak, restore its metrics, and surface guest DiskPressure

GitHub issue · Closed

Metadata
Source
tuist/tuist #10974
Updated
Jun 24, 2026
Domains
Compute
Details

Summary

Fixes from an incident where ProcessXcresultWorker jobs were failing with :enospc on the production macOS xcresult-processor VMs. They’re drafted together because they share one thread: a slow resource leak that ran unmonitored because the VMs’ observability had been silently broken since the Tailscale migration on 2026-05-18.

  1. fix(server) — stop the xcresult NIF from leaking attachment temp dirs (the actual disk filler).
  2. fix(infra) — restore the tart-kubelet metrics forwarder so the VMs’ Oban job metrics reach Prometheus again (why the leak was invisible on the dashboard).
  3. feat(infra) — make a full guest disk observable: publish a DiskPressure node condition and export the guest disk usage % as a Prometheus gauge to alert on, instead of it sitting at Unknown.

What happened

ProcessXcresultWorker runs only on the macOS xcresult-processor VMs (xcresulttool is Xcode-only). On the affected VM the guest APFS volume was at 100% — 130 GiB disk, ~58 GiB Xcode baseline, and ~47 GiB of leaked xcresult-attachments-* directories in the worker’s TMPDIR (14k+ dirs, one per job over ~8 days). Once full, downloads/extractions failed with :enospc; the one attempt that squeezed through then failed to parse a partially-written bundle, and Oban discarded after 5 attempts. The run was persisted as failed_processing.

It stayed invisible because (a) every job-event Grafana panel showed No data — the VM metrics weren’t being scraped — and (b) the node’s DiskPressure condition was stuck Unknown and nothing exported the guest disk %, so neither the scheduler nor any alert could react. The host disk, meanwhile, had 270+ GiB free, so a host-level signal would never have fired.

Fix 1 — attachment temp-dir leak (server)

XCResultParser.attachmentsByTestIdentifiers exported attachments into a process-wide makeTemporaryDirectory(prefix: "xcresult-attachments") and never cleaned it up. It can’t use the auto-cleaning runInTemporaryDirectory block its sibling calls use, because the exported files are read by the Elixir worker for S3 upload after parse() returns.

The worker already creates a per-run scratch dir (root_dir) and File.rm_rfs it once processing finishes. The fix exports attachments into a subdirectory of that caller-provided rootDirectory, so the worker’s existing cleanup reclaims them. Falls back to a temp dir only when no root is provided.

Fix 2 — metrics forwarder dial scope (infra)

tart-kubelet’s host-side metrics reverse proxy returned 502 for every scrape of the VM’s PromEx endpoint. The VM serves the event metrics fine (curl 192.168.64.2:9091 → 200 with live process_xcresult series) and the host reaches the VM directly (curl from the host → 200, 6/6) — but the forwarder’s Go dialer got EHOSTUNREACH:

WARN metrics forwarder: upstream proxy error listen=100.66.118.128:9091
err="dial tcp 192.168.64.2:9091: connect: no route to host"

After the Tailscale migration the forwarder binds to the tailnet IP and the dialer issues an unscoped connect(). On macOS with scoped routing the VM’s vmnet route carries the IFSCOPE flag (bound to the bridge) while the host’s primary interface is the public WAN, so the unscoped connect resolves against the wrong scope and fails — even though the bridge is up and reachable.

The fix adds a Dialer.Control hook that pins the upstream socket to the interface owning the directly-connected route to the target (IP_BOUND_IF on darwin, no-op elsewhere via a build-tagged file). go mod tidy also promotes x/sys to direct and corrects a stale indirect marking on prometheus/client_golang.

Fix 3 — make a full guest disk observable (infra)

tart-kubelet only posted NodeReady, leaving DiskPressure/MemoryPressure/PIDPressure at Unknown and exporting nothing about the guest volume. Two complementary signals are added, both driven by one tart exec ... df probe per running VM each heartbeat:

  • DiskPressure node condition — reported True when any guest root volume is at/above 90% (message names the offending VM), else False; seeded False at registration so it never sits at Unknown. Gives the scheduler a disk-pressure taint and is alertable via kube-state-metrics. Probes are time-bounded so an unresponsive guest agent can’t stall the heartbeat; per-VM errors are logged and skipped.
  • tart_kubelet_guest_disk_usage_percent{vm=...} gauge — the gradient signal behind the condition, registered on the controller-runtime registry that Alloy already scrapes via the tuist-macos-tart-kubelet job (:8080). Reset each sweep so a departed VM stops reporting stale capacity.

Measuring the guest volume (not the host) is deliberate — the host stays near-empty even when a guest is full. (MemoryPressure/PIDPressure remain unreported — a separate gap, not the incident’s failure mode.)

Alerting (Grafana-managed — wire these up, not in this repo)

Alerts here are managed in Grafana Cloud, not as code, so this PR ships the metric; the alert is created in Grafana. Recommended rules on the new gauge (this is what would have caught the 8-day climb long before ENOSPC):

  • Warn (notify): max by (instance, vm) (tart_kubelet_guest_disk_usage_percent) >= 80 for 10m
  • Page (matches the condition threshold): max by (instance, vm) (tart_kubelet_guest_disk_usage_percent) >= 90 for 5m
  • Probe/target down (so a blind spot is itself an alert): up{job="tuist-macos-tart-kubelet"} == 0 for 10m
  • Optional backstop on the condition: kube_node_status_condition{condition="DiskPressure",status="true"} == 1

This PR also adds a Guest Disk Usage panel (tart_kubelet_guest_disk_usage_percent, threshold lines at 80/90%) to the “Tuist xcresult Processor” dashboard JSON (infra/grafana-dashboards/xcresult-processor.json), which is synced as code.

Validation

  • swift test (xcresult_nif): 6/6 pass, incl. a new test asserting attachments land under the provided root.
  • go test ./... (tart-kubelet): pass, incl. new tests for the dial interface-resolution helper, the df capacity parser, the DiskPressure condition logic (True/False/keep-prior-on-error/nil-noop), and the guest-disk gauge (set + reset-drops-stale-series).
  • tart-kubelet builds for darwin/arm64 and linux/amd64; go vet/gofmt/swiftformat --lint clean.
  • Incident mitigation already applied out-of-band: leaked dirs cleared on both production VMs to restore service.

⚠️ The forwarder fix, the guest-df probe, and the gauge can only be fully confirmed on a Mac mini host (real macOS scoped routing / the Tart guest agent). Post-deploy checks: curl <node-tailnet-ip>:9091/metrics returns 200 (not 502); kubectl describe node shows a real DiskPressure status; tart_kubelet_guest_disk_usage_percent appears in Grafana.

Deliberately deferred

kubectl logs/exec against the Tart VM nodes — not implemented (options outlined for a future decision)

This is not a config fix; it’s three independent, stacked failures:

  1. The apiserver flag at infra/k8s/clusters/clusterclass-tuist.yaml:425 is ExternalIP,Hostname,InternalDNS,ExternalDNS — Tart nodes have no ExternalIP, so it falls to an unresolvable Hostname (the no such host error).
  2. The nodes’ only routable address is their Tailscale CGNAT IP, which the control-plane netns can’t reach (those routes live only on the egress/connector proxies).
  3. tart-kubelet implements no kubelet streaming server on :10250, so there’s nowhere for /exec//containerLogs to land even with DNS + routing solved.

Native kubectl logs/exec would require all three layers (none is independently useful), ~10–13d total:

  • Streaming server on :10250 (~5–8d) — serve /containerLogs (plain HTTP stream) + /exec (SPDY/WebSocket upgrade, stdio channel demux). TLS serving cert via the CSR API, validate the apiserver’s client cert against the cluster CA, SAR webhook for authz — the model the Linux kubelet config already specifies at clusterclass-tuist.yaml:471-489, and tart-kubelet already holds a cluster-CA kubeconfig to bootstrap it. Back exec with tart exec -i -t, logs with the guest service log file; k8s.io/kubelet/.../cri/streaming covers most of exec. Hardest part is the SPDY/WS channel demux, not the tart exec glue.
  • Reachability (~2–4d) — the control plane isn’t on the tailnet (only the egress ProxyGroup is). Recommended: join CP nodes to the tailnet with a scoped route-accept + an ACL grant to minis:10250.
  • Flag (<1d) — change the global arg to ExternalIP,InternalIP,Hostname,… so Linux nodes still match ExternalIP first (unchanged) and Tart nodes fall through to their now-routable InternalIP. Triggers a control-plane rollout; validate on canary.

Cheaper alternatives covering ~90% of the value in ~3d (preferred unless tooling requires the standard kubectl UX):

  • Ship the guest service logs to Loki via the existing observability/egress path — operators read logs in Grafana and kubectl logs becomes unnecessary. Best value/effort.
  • A tart-kubelet exec <pod> host subcommand over the existing tailnet SSH path (tag:tuist-ops → mini:22) for occasional interactive debugging — no apiserver/control-plane changes.

Until one of these is built, operate these nodes via host SSH + tart exec (as done in this incident).

Other

  • The stuck production helm pipeline (prod is on xcresult-processor 0.6.2; chart is at 0.9.0 since 2026-05-20) — none of these fixes reach prod until that’s unblocked.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] May 29, 2026

The fixes for stopping the xcresult-processor VM disk leak, restoring the metrics forwarder, and surfacing guest DiskPressure from this pull request are now available in xcresult-processor-image@0.9.1. Please update to this version to receive these improvements.

TA
tuist-atlas[bot] May 29, 2026

The changes from this pull request are now available in capi-scaleway@0.6.2. Update to this version to get the fix for the xcresult-processor VM disk leak, restored metrics, and the new guest DiskPressure surfacing.