Hive Hive
Sign in

feat(infra): reclaim stranded Scaleway Apple-silicon hosts back to the pool

GitHub issue · Closed

Metadata
Source
tuist/tuist #11056
Updated
Jun 24, 2026
Domains
Compute
Details

What changed

Adds a convergent backstop and a source-level fix so Scaleway Apple-silicon Mac minis can no longer strand outside the adopt pool.

  • OrphanReclaimer (controllers/orphan_reclaimer.go): a leader-gated periodic sweep. Each cycle it lists Scaleway hosts per zone (the zones from --orphan-reclaim-zones unioned with every live CR’s zone), diffs them against the live ScalewayAppleSiliconMachine CRs, and treats a host as stranded when it is not in the pool, not mid-adoption, and matches no live CR by name or status.serverID. Stranded hosts are exported on a scaleway_orphan_servers gauge and, when active reclaim is enabled, returned to the pool via the existing ReleaseToPool (rename + reinstall). Before each release it re-checks ownership against the live API (uncached APIReader) so a host adopted since the scan is never reclaimed.
  • Controller-level DefaultAdoptPoolPrefix: reconcileDelete now falls back to it when a CR’s Spec.AdoptPoolPrefix is empty, so legacy CRs release on delete instead of skipping. This closes the leak at the source for new deletions.
  • Chart wiring is a single macosFleet.orphanReclaim.activeReclaim toggle (default false = report-only). Everything else is derived from existing values: the pool prefix from macosFleet.adoptPoolPrefix, the swept zones from the fleets’ machine.zone, and the claimed-name prefix from the chart name (tuist-tuist-). The binary keeps its flags as a generic provider, but the chart computes them rather than duplicating config.

Why

The provider only releases a claimed host on the per-Machine reconcileDelete path. A host strands whenever that path doesn’t run to completion:

  1. A legacy CR with an empty AdoptPoolPrefix (created before the field was required) hit the documented skip branch: no release, just a ReleaseSkipped warning event.
  2. A force-delete or owner-reference GC that bypasses the finalizer.
  3. A crash between claiming a pool host and writing its CR.

A stranded host carries a CR-style Scaleway name but is owned by no CR, so nothing ever looks at it again. It keeps billing under Apple’s 24h floor, and because the claim drained it from the tuist-pool-* pool, it silently shrinks the capacity AdoptFromPool draws from. That depletion is what later surfaced as NoAvailableHost: a builders-fleet MachineDeployment stuck ScalingUp, which a server deploy’s helm --wait (kstatus watcher) gated on, wedging the deploy until timeout.

A confirmed instance: tuist-tuist-builders-fleet-rg4h9-c295c (an M2-L billing for 14 days, no CR), almost certainly leaked via the legacy-CR skip path.

Why this design over the obvious alternatives

  • Sweep, not just fix the delete path. Fixing reconcileDelete (the DefaultAdoptPoolPrefix change) only prevents future strands from the legacy-CR cause, and can’t touch what already leaked or what bypasses the finalizer entirely. A convergent sweep catches every strand cause regardless of how it happened, and self-heals existing leaks. Both are included; they’re complementary.
  • Name as the owned key. The claim renames a pool host to its CR’s name (AdoptFromPool(ctx, machine.Name, ...)), and the CR always exists before that rename. So a live CR name is an authoritative “owned” signal with no claimed-but-CR-missing window short of an actual delete. serverID is matched too, as a backstop.
  • Report-only by default; active reclaim gated. Two hosts must never be reclaimed even though they look unowned: one an operator is mid-provisioning under a Scaleway-default name (e.g. apple-silicon-romantic-* before the rename to tuist-pool-*), and one owned by another cluster that shares the Scaleway project (its CR isn’t visible to this cluster’s API server). The claim-name prefix (tuist-tuist-) scopes active reclaim to this cluster’s claimed namespace; with it unset the sweep only reports. Per the provider’s AGENTS.md, each environment already runs its own Scaleway IAM application/project, so the sweep is project-scoped and the cross-env case is moot in practice, but the gate is kept as defense in depth.
  • Derive chart config, don’t duplicate it. The pool prefix, zones, and claimed-name prefix already live in the chart; the template computes the flags from them so there’s a single source of truth and no drift. The only chart knob is the one thing the chart can’t safely decide on its own: whether the sweep may mutate.

Concurrency safety

  • The pre-mutation re-check (uncached APIReader) closes a TOCTOU where a CR created and adopted after the cycle’s initial snapshot could otherwise have its live host reclaimed.
  • ReleaseToPool now holds the same adoptMu as AdoptFromPool across its rename + reinstall, so an adoption scan can’t claim a host in the window where it carries the pool prefix but the reinstall hasn’t been requested. (This also hardens the pre-existing delete path; the sweep just made the race more reachable.)

User / operator impact

  • Stranded hosts stop being invisible: the scaleway_orphan_servers gauge surfaces them, and a sustained non-zero value is the signal that the pool is leaking (worth an alert).
  • With activeReclaim on, strays return to the pool automatically (rename + reinstall), so the pool stops silently draining, which removes one recurring trigger of wedged deploys.
  • Default behavior is conservative: detection and the delete-path fallback are on; active reclaim is opted into per-env.

Validation

  • go build ./..., go vet ./..., and go test -count=1 -race ./... clean across the provider module.
  • New tests: OrphanReclaimer reclaims exactly the stranded in-namespace host while leaving pool, claim-pending, CR-owned-by-name, and CR-owned-by-serverID hosts untouched; a host adopted since the scan is not reclaimed (the re-check); report-only mode mutates nothing but still counts strays on the gauge; ListServers and IsPoolOrAdopting unit tests; reconcileDelete releases a legacy CR via the controller default (and the existing skip-when-no-default test still passes).
  • helm template with the production values renders --default-adopt-pool-prefix=tuist-pool- and --orphan-reclaim-zones=fr-par-1 (report-only); --set macosFleet.orphanReclaim.activeReclaim=true adds --orphan-reclaim-claim-name-prefix=tuist-tuist-.

Follow-up

  • Active reclaim is report-only by default. Set macosFleet.orphanReclaim.activeReclaim: true per env (after confirming the env has its own Scaleway project) to enable it.
  • A scaleway_orphan_servers > 0 alert and a sustained-NoAvailableHost alert are the natural companions; not included here.

🤖 Generated with Claude Code

Comments
F
fortmarek Jun 3, 2026

Both findings addressed in d13cc83498.

High #1 — stale ownership snapshot. reclaimOnce no longer mutates from the initial snapshot. It now collects stranded candidates during the per-zone scan, then re-checks ownership against the live API (mgr.GetAPIReader(), uncached) immediately before ReleaseToPool, skipping any host adopted since the scan (matched by name or serverID — the claim renames the pool host to the CR’s name, so a name match catches a just-adopted host even before its status.serverID is written). A failed re-check aborts reclaim for the cycle rather than acting on stale data. New test TestOrphanReclaimer_SkipsHostAdoptedSinceScan: detection client has no owner, the re-check reader does → no release.

High #2 — ReleaseToPool exposes an adoptable host mid-release. ReleaseToPool now holds the same adoptMu as AdoptFromPool across the rename + reinstall request, so an adoption scan can’t observe the host in the pool-named-but-still-ready window. This also fixes the pre-existing variant on the delete path (the sweeper just made it more reachable). No re-entrancy: neither delete nor the sweeper holds adoptMu when calling ReleaseToPool.

Both verified under go test -race. As you noted, both were latent in report-only mode and only bite once claimNamePrefix is set; the committed values keep it report-only by default.

TA
tuist-atlas[bot] Jun 4, 2026

The feature to reclaim stranded Scaleway Apple-silicon hosts back to the pool is now available in capi-scaleway@0.8.0. Update to this version to get the OrphanReclaimer controller that detects and reclaims stranded hosts, plus the controller-level DefaultAdoptPoolPrefix fix for legacy CR deletions.