Hive Hive
Sign in

feat(infra): drift macOS host config on a fleet-wide config hash

GitHub issue · Closed

Metadata
Source
tuist/tuist #11439
Updated
Jun 24, 2026
Domains
Compute
Details

What

The Scaleway Apple Silicon CAPI provider’s drift loop re-pushes host config to already-Ready Mac minis only when the tart-kubelet binary SHA changes. This PR broadens the trigger to a fleet-wide host-config hash so a change to anything the operator pushes propagates on the next reconcile:

  • New bootstrap.HostConfigHash(cfg) — a sha256 over every rendered install script (firewall + vmnat, PN interface, launchd job + plist, tailscale, node_exporter, tart-kubelet install) plus the SHA of each embedded binary (tart-kubelet, tailscale, node_exporter).
  • The manager computes it once at startup (like the binary SHA) from a Config carrying only operator-image + fleet-config inputs, with every per-host field zeroed (NodeName, IP, kubeconfig, VLAN, auth key, DisableVMGC, …), so the fingerprint is identical across the fleet.
  • New Status.HostConfigHash on the CR (Go type and the helm-chart CRD schema — see below); the reconciler stamps it on a successful re-push.
  • The drift gate moves from binaryDrift (tart-kubelet SHA only) to configDrift (the host-config hash). TartKubeletBinarySHA is still computed and stamped for observability.
  • The install functions’ inlined script := strings are extracted into pure render*(cfg) string helpers shared by the installer and the hash — what gets pushed to hosts is byte-for-byte unchanged.

Why

Last week’s macOS runner cache outage needed three host-config fixes (a NAT vlan-detection fix, a pfctl direct-anchor-load fix, an EM pn0 hardening). Each landed in the operator image, but the drift loop ignored them: it only re-pushes on a tart-kubelet binary change, so a firewall/script-only or fleet-config-only fix never reached the already-provisioned hosts. The fix had to be hand-applied over SSH to all 9 hosts and the CRs force-patched to re-trigger the loop. This closes that gap: a script tweak, a fleet CIDR/tag/accept-routes change, or a re-baked tailscale/node_exporter binary now rolls to existing hosts automatically.

Why a fleet-wide canonical hash (over per-host or operator-binary)

  • Per-host hash would have to exclude volatile fields (kubeconfig token, auth key) by hand or it re-pushes on every token rotation. Zeroing all per-host fields and computing one fleet-wide hash sidesteps that entirely — there are no per-host inputs left to churn.
  • Hashing the operator binary would fire on every unrelated provider code change (controller bugfixes, base-image bumps), re-pushing for no host-config reason. Hashing the rendered scripts + pushed binaries is the right granularity.

TartTarball is deliberately excluded from the binary set: Tart is bootstrap-only (the hypervisor can’t be swapped under running VMs, so UpdateTartKubelet never re-installs it); a Tart bump rolls via Machine replacement, not config drift.

CRD registration (important)

The ScalewayAppleSiliconMachine CRD status schema is structural and enumerates its properties with no x-kubernetes-preserve-unknown-fields, so the API server prunes any status field not in the schema. The new hostConfigHash is therefore added to the checked-in CRD (infra/helm/tuist/crds/...scalewayapplesiliconmachines.yaml, auto-applied per-env by the deploy pipeline). Without it, Status.HostConfigHash would be silently dropped on write → read back empty every reconcile → configDrift stays true → the loop re-pushes the host config on every reconcile. (The helm CRD is a hand-maintained artifact — it carries chart labels beyond raw controller-gen output — so the field is added surgically rather than by a lossy raw regen.)

Migration

Existing Machines have an empty Status.HostConfigHash, so the first reconcile after this operator image rolls out drifts once and re-pushes — the intended one-time migration. The re-push is zero-downtime: running Tart VMs survive UpdateTartKubelet (nohup-detached, re-bound by the kubelet’s startup state recovery). Terminal-failed CRs stay excluded until Status.FailureReason is cleared.

Validation

  • Both Go modules (macos-host-bootstrap, cluster-api-provider-scaleway-applesilicon): go build ./..., go vet ./..., gofmt -l . (clean), go test ./... all pass.
  • New bootstrap tests: hash is stable for the same config, independent of per-host fields (incl. VMCachePNVLAN, DisableVMGC), changes on a fleet-config field (CIDR/tags/accept-routes) and on any embedded-binary change.
  • New controller test TestHostConfigDrift covers match / mismatch / the empty-machine-hash migration case / empty-operator-hash (never drifts).
  • CRD YAML re-parsed: status.properties.hostConfigHash present as a string alongside tartKubeletBinarySHA.
  • Verified the render* extraction is byte-faithful via git diff (only script := assignments and RunCommand* wrappers moved; no heredoc body changed).

🤖 Generated with Claude Code

Comments

No GitHub comments yet.