Hive Hive
Sign in

feat(capi-scaleway): pool-release semantics + fix the broken billing-floor fallback

GitHub issue · Closed

Metadata
Source
tuist/tuist #10922
Updated
Jun 24, 2026
Domains
Compute
Details

What was broken

Run 26219893038 timed out after 35 minutes with all three CAPI MachineDeployments stuck at 0/1 available replicas. The cluster events show why:

scalewayapplesiliconmachine/tuist-tuist-macos-fleet-0
Scaleway DeleteServer: scaleway-sdk-go: precondition failed:
unknown precondition, this server cannot be deleted before 2026-06-12 07:06:59 (will retry)

CAPI was mid-roll. Scaleway refused DeleteServer because the Mac mini was inside its multi-week Apple-licensing billing floor. With strategy: OnDelete CAPI can’t bring up the replacement until the old Machine finalizes, the controller retried every 60s, and helm upgrade --atomic --wait hung against the 0/1 MachineDeployment for the full 30-min helm timeout. The pipeline wedged on a stuck fleet reconciliation that has nothing to do with the server change being shipped.

Two independent things were wrong:

  1. The controller’s billing-floor fallback was unreachable. DeleteServer already has a deliberate UpdateServer(scheduleDeletion=true) fallback for HTTP 412, but the predicate isPreconditionFailed only checked *scw.ResponseError, and scaleway-sdk-go’s hasResponseError returns the more specific *scw.PreconditionFailedError for parsed standard errors. The fallback rotted unreachable with zero test coverage.
  2. The lifecycle model was wrong. Cluster-driven Machine churn (rolls, scaling, replacement) shouldn’t trigger physical-host destruction — the operator already owns host pre-ordering through the tuist-pool- workflow, so they should own host destruction too. Coupling the two lifecycles put the 24h Apple licensing floor directly into the deploy critical path.

Fixes

Fix 1 — controller predicate

isPreconditionFailed now matches both *scw.PreconditionFailedError (the typed return after hasResponseError’s unmarshalStandardError) and the generic *scw.ResponseError with StatusCode == 412 (the fallback for unparsed responses). The schedule-deletion fallback was always meant to be the safety net for the terminate path; this makes it reachable. New tests pin both error shapes plus the two interesting fallback edges. The first one fails verbatim against the old code with the exact error string from the failed canary run.

Fix 2 — lifecycle model

New ReleaseToPool(id, zone, poolPrefix) on the Scaleway client renames a server back into <poolPrefix><uuid> and triggers Scaleway’s ReinstallServer (disk wipe + reimage with the server type’s default OS, ~15-20 min async on M2-L). reconcileDelete branches: pool-adopted Machines (the default) go through ReleaseToPool; auto-order Machines and any Machine annotated tuist.dev/release-policy: terminate keep the legacy DeleteServer path.

Cluster Machine churn no longer destroys physical hosts. The Mac mini stays alive, comes back to factory-default state, and is re-eligible for the next AdoptByPrefix once Scaleway flips it to Delivered + Ready. The 24h billing floor doesn’t enter cluster reconciliation at all. Bootstrap doesn’t need to be re-entrant across multiple adoptions because every adopt sees a freshly reinstalled host.

ReleaseToPool is idempotent across controller crashes: rename to a fresh UUID is safe to repeat (both pool names are valid), a second reinstall while one is in flight returns a TransientStateError we swallow, and 404 on either step means an operator-deleted host so we let the Machine finalizer drop cleanly. Reinstall itself is fire-and-forget — AdoptByPrefix‘s Delivered + Status == Ready filter naturally excludes a host while it’s reinstalling, so we don’t hold up Machine finalization waiting for Scaleway to finish the wipe.

Operator opt-out for broken hosts that shouldn’t re-adopt (kernel panic loop, hardware fault, retired SKU):

kubectl -n "$NS" annotate scalewayapplesiliconmachine <name> \
tuist.dev/release-policy=terminate --overwrite
kubectl -n "$NS" delete machine <machine-name>

That branch uses DeleteServer, which is exactly where Fix 1’s schedule-deletion fallback now reliably fires.

What I considered and pulled out

I had a Layer 3 in here that switched the deploy workflow to helm 4’s --wait=hookOnly plus an explicit kubectl rollout status loop, as defense-in-depth against future async-reconciliation hiccups stalling deploys. Pulled it out — the root cause is fixed at the controller and there’s no concrete second failure mode worth defending against speculatively.

While I was in there, server-deployment.yml already carries a fair amount of ad-hoc bash (recover-interrupted-helm, clean-stale-jobs, Kura runtime tag resolution, watcher process, manual rollback) and adding more crowds out a cleaner refactor. Worth a separate conversation: extract the deploy logic into a versioned shell script (infra/mise/tasks/k8s/deploy-server.sh) so it gets shellcheck, IDE highlighting, and local-run support, or rebuild the rollout around something more elaborate than bash-in-YAML (helmfile, ArgoCD/Flux, or a small Go binary). Not for this PR — flagging it as a follow-up to discuss separately.

Test plan

  • go test ./... in the controller module passes; new tests cover both the predicate fix and the pool-release semantics + branching matrix (17 new tests total: 4 TestDeleteServer_*, 6 TestReleaseToPool_*, 4 TestShouldReleaseToPool_*, 3 TestReconcileDelete_*).
  • PR merges to main and the next server-production-deployment workflow runs canary → acceptance tests → production end-to-end. With the controller image rolled out, cluster Machine churn renames + reinstalls hosts instead of calling DeleteServer, so MachineDeployments converge within minutes and helm upgrade --atomic --wait no longer hangs on the billing floor.
  • Confirm a routine kubectl delete machine against any pool-adopted Machine renames the Scaleway server into the pool namespace and triggers a reinstall (no DeleteServer call).
  • Confirm the stuck tuist-tuist-macos-fleet-0 Machine in canary clears either via the schedule-deletion fallback (Fix 1) or, more likely, because the new release path renamed it back to pool and triggered a reinstall — either way the MachineDeployment converges and the operator can decide whether to physically release the underlying host via the Scaleway console at their leisure.
Comments

No GitHub comments yet.