Hive
fix(infra): handle sudo password + add tiered BootstrapFailed recovery
GitHub issue · Closed
Two related provider-robustness fixes surfaced by a production fleet bring-up incident this week (see follow-up #5 in the original incident PR).
(1) <sealed> sudo password handling
The Scaleway Apple Silicon API can return the literal placeholder <sealed> in both sudo_password and vnc_url while macOS Tahoe’s OS-level seal of the auto-login credential is in effect — observed on a fresh adopt and during reboot windows. The codebase already has a fallback that re-fetches the password from vnc_url when the top-level field is empty, but the check was a bare == \"\" that accepted <sealed> as a non-empty (and therefore "valid") password.
Bootstrap then handed <sealed> to sudo verbatim, sudo rejected it as Sorry, try again, and after a handful of retries pam_opendirectory locked the m1 account — blocking the host until a manual reboot.
Treat the marker as missing in both surfaces:
scalewayServerToServerfalls through tovnc_urlwhen the top-level field is<sealed>(with<sealed>fromvnc_url’s embedded password component also rejected).- The controller’s recovery gate also treats
<sealed>in the staged Secret as missing, so a Secret persisted from a pre-fix reconcile self-heals on the next loop.
(2) Tiered host recovery on persistent BootstrapFailed
On bootstrap.Run error the reconciler previously only requeued after 60s, which meant a host stuck in an unrecoverable state (stale authorized_keys from a previous tenant, OS corruption, wedged sshd) got SSH-hammered indefinitely against the same broken server. New tiered recovery via handleBootstrapFailure:
-
Tier 1 (
--bootstrap-reboot-after, default 3) — the controller asks Scaleway to reboot the host once. Clears volatile state (PAM lockouts, sshd connection throttling, half-open sessions) without paying for the 5-15 min disk reinstall. Gated onStatus.BootstrapRebootIssuedso the reboot fires once per host even across a long retry tail. -
Tier 2 (
--bootstrap-max-attempts, default 8) — the controller returns the host to the adopt pool viaReleaseToPool. Scaleway’sReinstallServerthen wipes the disk and the next reconcile claims a different mini viaAdoptFromPool. ServerID + counters reset because they describe the now-discarded host.
Each tier degrades gracefully on Scaleway API errors:
- The
RebootServer5xx path does not consume the one-shot flag, so the next reconcile retries the cheap recovery rather than skipping straight to the expensive reinstall. - A release-tier API error keeps
Status.ServerIDpopulated so we never start adopting a new host while the old one is still wedged in our account.
Both thresholds can be set to 0 to disable the corresponding tier (operator escape hatch for fleet-wide disruption).
What this PR doesn’t change
ReleaseToPoolalready callsReinstallServer(full disk wipe + factory image) — clean-slate guarantee for the next pool consumer is unchanged.- Tier 1’s reboot is purely a recovery escalation, not a step in the normal release path. Releasing a working mini at MachineDeployment scale-down still goes straight to the reinstall flow.
I considered making Tier 1’s reboot fire on every release-to-pool (as a fast clean-state hop before the slow reinstall), but kept it scoped to the recovery path because the reinstall already disk-wipes — a pre-release reboot would just add ~2 min latency to every scale-down without changing what arrives in the pool.
Happy to revisit if you’d rather see the reboot tier integrated into the release path or removed entirely (Tier 2 alone covers the original incident).
Test plan
-
go build ./...clean in the provider module -
go vet ./...clean -
go test ./...— all pre-existing tests still pass - New unit tests:
TestScalewayServerToServer_FallsBackToVncURLWhenSudoPasswordIsSealedTestScalewayServerToServer_RejectsSealedMarkerFromVncURL(both "vnc_url is the marker" and "vnc_url embeds the marker as password" cases)TestClientRebootServer_{SwallowsNotFound,SwallowsTransientState,HappyPathRecordsCall}TestHandleBootstrapFailure_IncrementsAttemptsTestHandleBootstrapFailure_RebootsAtThresholdOnce(one-shot guard)TestHandleBootstrapFailure_ReleasesToPoolAtMaxTestHandleBootstrapFailure_ReleaseAPIErrorKeepsStateTestHandleBootstrapFailure_RebootAPIErrorDoesNotConsumeOneShotTestHandleBootstrapFailure_LegacyCRWithoutPoolPrefixNeverReleasesTestHandleBootstrapFailure_DisabledThresholdsSkipBothTiers
CRD update applied to infra/helm/tuist/crds/infrastructure.cluster.x-k8s.io_scalewayapplesiliconmachines.yaml so the new status fields land in the chart at deploy time.
🤖 Generated with Claude Code
No GitHub comments yet.