Hive
feat(infra): drift macOS host config on a fleet-wide config hash
GitHub issue · Closed
What
The Scaleway Apple Silicon CAPI provider’s drift loop re-pushes host config to already-Ready Mac minis only when the tart-kubelet binary SHA changes. This PR broadens the trigger to a fleet-wide host-config hash so a change to anything the operator pushes propagates on the next reconcile:
- New
bootstrap.HostConfigHash(cfg)— a sha256 over every rendered install script (firewall + vmnat, PN interface, launchd job + plist, tailscale, node_exporter, tart-kubelet install) plus the SHA of each embedded binary (tart-kubelet, tailscale, node_exporter). - The manager computes it once at startup (like the binary SHA) from a
Configcarrying only operator-image + fleet-config inputs, with every per-host field zeroed (NodeName, IP, kubeconfig, VLAN, auth key,DisableVMGC, …), so the fingerprint is identical across the fleet. - New
Status.HostConfigHashon the CR (Go type and the helm-chart CRD schema — see below); the reconciler stamps it on a successful re-push. - The drift gate moves from
binaryDrift(tart-kubelet SHA only) toconfigDrift(the host-config hash).TartKubeletBinarySHAis still computed and stamped for observability. - The install functions’ inlined
script :=strings are extracted into purerender*(cfg) stringhelpers shared by the installer and the hash — what gets pushed to hosts is byte-for-byte unchanged.
Why
Last week’s macOS runner cache outage needed three host-config fixes (a NAT vlan-detection fix, a pfctl direct-anchor-load fix, an EM pn0 hardening). Each landed in the operator image, but the drift loop ignored them: it only re-pushes on a tart-kubelet binary change, so a firewall/script-only or fleet-config-only fix never reached the already-provisioned hosts. The fix had to be hand-applied over SSH to all 9 hosts and the CRs force-patched to re-trigger the loop. This closes that gap: a script tweak, a fleet CIDR/tag/accept-routes change, or a re-baked tailscale/node_exporter binary now rolls to existing hosts automatically.
Why a fleet-wide canonical hash (over per-host or operator-binary)
- Per-host hash would have to exclude volatile fields (kubeconfig token, auth key) by hand or it re-pushes on every token rotation. Zeroing all per-host fields and computing one fleet-wide hash sidesteps that entirely — there are no per-host inputs left to churn.
- Hashing the operator binary would fire on every unrelated provider code change (controller bugfixes, base-image bumps), re-pushing for no host-config reason. Hashing the rendered scripts + pushed binaries is the right granularity.
TartTarball is deliberately excluded from the binary set: Tart is bootstrap-only (the hypervisor can’t be swapped under running VMs, so UpdateTartKubelet never re-installs it); a Tart bump rolls via Machine replacement, not config drift.
CRD registration (important)
The ScalewayAppleSiliconMachine CRD status schema is structural and enumerates its properties with no x-kubernetes-preserve-unknown-fields, so the API server prunes any status field not in the schema. The new hostConfigHash is therefore added to the checked-in CRD (infra/helm/tuist/crds/...scalewayapplesiliconmachines.yaml, auto-applied per-env by the deploy pipeline). Without it, Status.HostConfigHash would be silently dropped on write → read back empty every reconcile → configDrift stays true → the loop re-pushes the host config on every reconcile. (The helm CRD is a hand-maintained artifact — it carries chart labels beyond raw controller-gen output — so the field is added surgically rather than by a lossy raw regen.)
Migration
Existing Machines have an empty Status.HostConfigHash, so the first reconcile after this operator image rolls out drifts once and re-pushes — the intended one-time migration. The re-push is zero-downtime: running Tart VMs survive UpdateTartKubelet (nohup-detached, re-bound by the kubelet’s startup state recovery). Terminal-failed CRs stay excluded until Status.FailureReason is cleared.
Validation
- Both Go modules (
macos-host-bootstrap,cluster-api-provider-scaleway-applesilicon):go build ./...,go vet ./...,gofmt -l .(clean),go test ./...all pass. - New
bootstraptests: hash is stable for the same config, independent of per-host fields (incl.VMCachePNVLAN,DisableVMGC), changes on a fleet-config field (CIDR/tags/accept-routes) and on any embedded-binary change. - New controller test
TestHostConfigDriftcovers match / mismatch / the empty-machine-hash migration case / empty-operator-hash (never drifts). - CRD YAML re-parsed:
status.properties.hostConfigHashpresent as a string alongsidetartKubeletBinarySHA. - Verified the
render*extraction is byte-faithful viagit diff(onlyscript :=assignments andRunCommand*wrappers moved; no heredoc body changed).
🤖 Generated with Claude Code
No GitHub comments yet.