What changed
Adds a convergent backstop and a source-level fix so Scaleway Apple-silicon Mac minis can no longer strand outside the adopt pool.
OrphanReclaimer (controllers/orphan_reclaimer.go): a leader-gated periodic sweep. Each cycle it lists Scaleway hosts per zone (the zones from --orphan-reclaim-zones unioned with every live CR’s zone), diffs them against the live ScalewayAppleSiliconMachine CRs, and treats a host as stranded when it is not in the pool, not mid-adoption, and matches no live CR by name or status.serverID. Stranded hosts are exported on a scaleway_orphan_servers gauge and, when active reclaim is enabled, returned to the pool via the existing ReleaseToPool (rename + reinstall). Before each release it re-checks ownership against the live API (uncached APIReader) so a host adopted since the scan is never reclaimed.
- Controller-level
DefaultAdoptPoolPrefix: reconcileDelete now falls back to it when a CR’s Spec.AdoptPoolPrefix is empty, so legacy CRs release on delete instead of skipping. This closes the leak at the source for new deletions.
- Chart wiring is a single
macosFleet.orphanReclaim.activeReclaim toggle (default false = report-only). Everything else is derived from existing values: the pool prefix from macosFleet.adoptPoolPrefix, the swept zones from the fleets’ machine.zone, and the claimed-name prefix from the chart name (tuist-tuist-). The binary keeps its flags as a generic provider, but the chart computes them rather than duplicating config.
Why
The provider only releases a claimed host on the per-Machine reconcileDelete path. A host strands whenever that path doesn’t run to completion:
- A legacy CR with an empty
AdoptPoolPrefix (created before the field was required) hit the documented skip branch: no release, just a ReleaseSkipped warning event.
- A force-delete or owner-reference GC that bypasses the finalizer.
- A crash between claiming a pool host and writing its CR.
A stranded host carries a CR-style Scaleway name but is owned by no CR, so nothing ever looks at it again. It keeps billing under Apple’s 24h floor, and because the claim drained it from the tuist-pool-* pool, it silently shrinks the capacity AdoptFromPool draws from. That depletion is what later surfaced as NoAvailableHost: a builders-fleet MachineDeployment stuck ScalingUp, which a server deploy’s helm --wait (kstatus watcher) gated on, wedging the deploy until timeout.
A confirmed instance: tuist-tuist-builders-fleet-rg4h9-c295c (an M2-L billing for 14 days, no CR), almost certainly leaked via the legacy-CR skip path.
Why this design over the obvious alternatives
- Sweep, not just fix the delete path. Fixing
reconcileDelete (the DefaultAdoptPoolPrefix change) only prevents future strands from the legacy-CR cause, and can’t touch what already leaked or what bypasses the finalizer entirely. A convergent sweep catches every strand cause regardless of how it happened, and self-heals existing leaks. Both are included; they’re complementary.
- Name as the owned key. The claim renames a pool host to its CR’s name (
AdoptFromPool(ctx, machine.Name, ...)), and the CR always exists before that rename. So a live CR name is an authoritative “owned” signal with no claimed-but-CR-missing window short of an actual delete. serverID is matched too, as a backstop.
- Report-only by default; active reclaim gated. Two hosts must never be reclaimed even though they look unowned: one an operator is mid-provisioning under a Scaleway-default name (e.g.
apple-silicon-romantic-* before the rename to tuist-pool-*), and one owned by another cluster that shares the Scaleway project (its CR isn’t visible to this cluster’s API server). The claim-name prefix (tuist-tuist-) scopes active reclaim to this cluster’s claimed namespace; with it unset the sweep only reports. Per the provider’s AGENTS.md, each environment already runs its own Scaleway IAM application/project, so the sweep is project-scoped and the cross-env case is moot in practice, but the gate is kept as defense in depth.
- Derive chart config, don’t duplicate it. The pool prefix, zones, and claimed-name prefix already live in the chart; the template computes the flags from them so there’s a single source of truth and no drift. The only chart knob is the one thing the chart can’t safely decide on its own: whether the sweep may mutate.
Concurrency safety
- The pre-mutation re-check (uncached
APIReader) closes a TOCTOU where a CR created and adopted after the cycle’s initial snapshot could otherwise have its live host reclaimed.
ReleaseToPool now holds the same adoptMu as AdoptFromPool across its rename + reinstall, so an adoption scan can’t claim a host in the window where it carries the pool prefix but the reinstall hasn’t been requested. (This also hardens the pre-existing delete path; the sweep just made the race more reachable.)
User / operator impact
- Stranded hosts stop being invisible: the
scaleway_orphan_servers gauge surfaces them, and a sustained non-zero value is the signal that the pool is leaking (worth an alert).
- With
activeReclaim on, strays return to the pool automatically (rename + reinstall), so the pool stops silently draining, which removes one recurring trigger of wedged deploys.
- Default behavior is conservative: detection and the delete-path fallback are on; active reclaim is opted into per-env.
Validation
go build ./..., go vet ./..., and go test -count=1 -race ./... clean across the provider module.
- New tests:
OrphanReclaimer reclaims exactly the stranded in-namespace host while leaving pool, claim-pending, CR-owned-by-name, and CR-owned-by-serverID hosts untouched; a host adopted since the scan is not reclaimed (the re-check); report-only mode mutates nothing but still counts strays on the gauge; ListServers and IsPoolOrAdopting unit tests; reconcileDelete releases a legacy CR via the controller default (and the existing skip-when-no-default test still passes).
helm template with the production values renders --default-adopt-pool-prefix=tuist-pool- and --orphan-reclaim-zones=fr-par-1 (report-only); --set macosFleet.orphanReclaim.activeReclaim=true adds --orphan-reclaim-claim-name-prefix=tuist-tuist-.
Follow-up
- Active reclaim is report-only by default. Set
macosFleet.orphanReclaim.activeReclaim: true per env (after confirming the env has its own Scaleway project) to enable it.
- A
scaleway_orphan_servers > 0 alert and a sustained-NoAvailableHost alert are the natural companions; not included here.
🤖 Generated with Claude Code