fix(infra): handle sudo password + add tiered BootstrapFailed recovery

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11013

Updated

Jun 24, 2026

Domains

Compute

Details

Two related provider-robustness fixes surfaced by a production fleet bring-up incident this week (see follow-up #5 in the original incident PR).

(1) `<sealed>` sudo password handling

The Scaleway Apple Silicon API can return the literal placeholder <sealed> in both sudo_password and vnc_url while macOS Tahoe’s OS-level seal of the auto-login credential is in effect — observed on a fresh adopt and during reboot windows. The codebase already has a fallback that re-fetches the password from vnc_url when the top-level field is empty, but the check was a bare == \"\" that accepted <sealed> as a non-empty (and therefore "valid") password.

Bootstrap then handed <sealed> to sudo verbatim, sudo rejected it as Sorry, try again, and after a handful of retries pam_opendirectory locked the m1 account — blocking the host until a manual reboot.

Treat the marker as missing in both surfaces:

scalewayServerToServer falls through to vnc_url when the top-level field is <sealed> (with <sealed> from vnc_url’s embedded password component also rejected).
The controller’s recovery gate also treats <sealed> in the staged Secret as missing, so a Secret persisted from a pre-fix reconcile self-heals on the next loop.

(2) Tiered host recovery on persistent BootstrapFailed

On bootstrap.Run error the reconciler previously only requeued after 60s, which meant a host stuck in an unrecoverable state (stale authorized_keys from a previous tenant, OS corruption, wedged sshd) got SSH-hammered indefinitely against the same broken server. New tiered recovery via handleBootstrapFailure:

Tier 1 (--bootstrap-reboot-after, default 3) — the controller asks Scaleway to reboot the host once. Clears volatile state (PAM lockouts, sshd connection throttling, half-open sessions) without paying for the 5-15 min disk reinstall. Gated on Status.BootstrapRebootIssued so the reboot fires once per host even across a long retry tail.
Tier 2 (--bootstrap-max-attempts, default 8) — the controller returns the host to the adopt pool via ReleaseToPool. Scaleway’s ReinstallServer then wipes the disk and the next reconcile claims a different mini via AdoptFromPool. ServerID + counters reset because they describe the now-discarded host.

Each tier degrades gracefully on Scaleway API errors:

The RebootServer 5xx path does not consume the one-shot flag, so the next reconcile retries the cheap recovery rather than skipping straight to the expensive reinstall.
A release-tier API error keeps Status.ServerID populated so we never start adopting a new host while the old one is still wedged in our account.

Both thresholds can be set to 0 to disable the corresponding tier (operator escape hatch for fleet-wide disruption).

What this PR doesn’t change

ReleaseToPool already calls ReinstallServer (full disk wipe + factory image) — clean-slate guarantee for the next pool consumer is unchanged.
Tier 1’s reboot is purely a recovery escalation, not a step in the normal release path. Releasing a working mini at MachineDeployment scale-down still goes straight to the reinstall flow.

I considered making Tier 1’s reboot fire on every release-to-pool (as a fast clean-state hop before the slow reinstall), but kept it scoped to the recovery path because the reinstall already disk-wipes — a pre-release reboot would just add ~2 min latency to every scale-down without changing what arrives in the pool.

Happy to revisit if you’d rather see the reboot tier integrated into the release path or removed entirely (Tier 2 alone covers the original incident).

Test plan

go build ./... clean in the provider module
go vet ./... clean
go test ./... — all pre-existing tests still pass
New unit tests:
- TestScalewayServerToServer_FallsBackToVncURLWhenSudoPasswordIsSealed
- TestScalewayServerToServer_RejectsSealedMarkerFromVncURL (both "vnc_url is the marker" and "vnc_url embeds the marker as password" cases)
- TestClientRebootServer_{SwallowsNotFound,SwallowsTransientState,HappyPathRecordsCall}
- TestHandleBootstrapFailure_IncrementsAttempts
- TestHandleBootstrapFailure_RebootsAtThresholdOnce (one-shot guard)
- TestHandleBootstrapFailure_ReleasesToPoolAtMax
- TestHandleBootstrapFailure_ReleaseAPIErrorKeepsState
- TestHandleBootstrapFailure_RebootAPIErrorDoesNotConsumeOneShot
- TestHandleBootstrapFailure_LegacyCRWithoutPoolPrefixNeverReleases
- TestHandleBootstrapFailure_DisabledThresholdsSkipBothTiers

CRD update applied to infra/helm/tuist/crds/infrastructure.cluster.x-k8s.io_scalewayapplesiliconmachines.yaml so the new status fields land in the chart at deploy time.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.

fix(infra): handle sudo password + add tiered BootstrapFailed recovery

(1) <sealed> sudo password handling

(2) Tiered host recovery on persistent BootstrapFailed

What this PR doesn’t change

Test plan

(1) `<sealed>` sudo password handling