Hive
feat(infra): wire CAPI core’s workload connection for the Mac-mini fleets
GitHub issue · Closed
What changed
The in-cluster CAPI that manages the Mac-mini fleets (macos / runners / builders) as Machine/MachineDeployment objects had several gaps that made every server deploy time out and roll back. This PR closes them.
1. CAPI workload connection (the wire-up)
Why a self-managed CAPI needs a kubeconfig to its own cluster. The Mac minis are real Nodes of this app cluster (the #10499 design — so macOS work like xcresult-processor and runners schedules as ordinary Deployments alongside everything else, rather than through a central fleet manager that would be a SPOF), and the CAPI that manages them runs inside this same cluster. That placement is deliberate, not incidental: the provider mints each mini’s join-credentials — API URL, CA, a bootstrap.kubernetes.io/token — from its own in-cluster ServiceAccount context, so a mini joins the cluster the provider runs in with no stored, cross-cluster credentials (see the provider’s AGENTS.md: “No 1Password entry, no manual rotation”). Running it from the Hetzner management cluster instead would either land the minis in mgmt (wrong cluster) or require injecting the app cluster’s API/CA and creating bootstrap tokens cross-cluster — exactly the plumbing this design avoids. The mgmt cluster is scoped to caph provisioning the Hetzner clusters themselves, not to bolting node pools onto already-running ones.
The residual cost of self-managing is that CAPI core’s generic cluster-cache still insists on connecting via a <cluster>-kubeconfig Secret — even when “the workload cluster” is itself. That Secret was missing, so CAPI fell back to the stub controlPlaneEndpoint (127.0.0.1) in capi-cluster.yaml and could never reach the cluster to watch Nodes / set Machine.status.nodeRef. This PR supplies it — the one part of the design that needs a stored credential (the bootstrap path above deliberately needs none):
capi-remoteSA + scopedClusterRole(nodesget/list/watch/patch/delete;podsget/list;pods/evictioncreate for drain;nonResourceURLs: ["/"]get for CAPI’s cluster-cache health probe) + a non-expiring SA-token Secret —capi-remote-rbac.yaml.- An
ExternalSecretsyncing the kubeconfig from 1Password into<release>-capi-kubeconfig—capi-remote-kubeconfig-external-secret.yaml. It’s apre-upgradehook (lands beforehelm --waitevaluates the fleet MDs) withcreationPolicy: Orphan: the hook’sbefore-hook-creationpolicy deletes the prior ExternalSecret on every upgrade, and an Owner-created Secret would be garbage-collected with it (its ownerReference points at the ExternalSecret), disconnecting CAPI mid-deploy. Orphan keeps the Secret unowned; together withdeletionPolicy: Retainthe synced kubeconfig survives hook churn and an--atomicrollback. - Enabled for managed envs in
values-managed-common.yaml; off by default so self-hosters without the 1P item are unaffected.
2. tart-kubelet reload robustness
The CAPI provider’s drift-loop reload ran bootout+bootstrap unconditionally and returned success on shell exit. On a headless Mac that pair races and re-registers the plist with Background Task Management, whose “legacy daemon” notification cap then stops honouring KeepAlive’s automatic respawn — so a clean exit never restarts and the Node goes NotReady. Because the reconciler recorded the SHA roll as done regardless, it never retried and the Node stayed stranded (observed flapping canary’s builders/runners fleets and blocking the deploy’s helm --wait).
loadTartKubeletLaunchd now:
- Restarts in place with
kickstart -kwhen the rendered plist is byte-identical to what’s installed (the common binary-only roll) — no BTM re-registration churn; only rewrites + bootstraps on an actual arg change. - Requires the new PID to be stable (same pid a few seconds later), not merely to appear, with a kickstart fallback, and exits non-zero otherwise so the reconciler keeps the drift set and retries instead of recording a roll that never took. (A no-op kickstart leaves the old process; a code-signing crash-loop briefly shows a transient pid on each respawn — the stability check rejects both.)
And installTartKubelet now codesign --force --sign -s the binary right after upload. Overwriting /usr/local/bin/tart-kubelet in place (same inode) leaves macOS’s AMFI validating the new pages against the previous binary’s cached cdhash, so the kernel kills the new process with OS_REASON_CODESIGNING and the Node goes NotReady — even though the Go linker ad-hoc signature is valid. A forced re-sign refreshes the signature and invalidates the stale cache so the rolled binary runs. (Surfaced by the canary providerID smoke-test, which codesigning-killed 2 of 3 minis on the roll.)
3. Node providerID — CAPI Node↔Machine binding
CAPI core binds a Machine to its Node by matching Node.spec.providerID == Machine.spec.providerID. tart-kubelet never set Node.spec.providerID, so every fleet node had to be patched by hand (kubectl patch node … providerID) before its MachineDeployment would report available — and any re-created/re-rolled node silently broke binding again. Threaded end to end:
- The provider already computes
scw-applesilicon://<zone>/<id>and stores it onMachine.spec.providerID; it’s now passed intobootstrap.Config.ProviderIDfor both first bootstrap (Run) and the drift roll (UpdateTartKubelet). renderLaunchdPlistemits a--provider-id=<id>flag when set.- tart-kubelet’s
nodeagentsetsNode.spec.providerIDat registration and, for nodes that registered before this change, patches it when empty (spec.providerIDis immutable once set, so it’s never overwritten; the status-only heartbeat path can’t write spec, so the patch happens inensureNode).
This removes the manual-patch step from onboarding, and existing empty-providerID nodes (e.g. production’s 11) get patched on the deploy’s kubelet re-roll — so the cascade binds the fleet without a manual N-node patch.
Why (root cause of the canary deploy failures)
Every fleet MachineDeployment was stuck 0 available because CAPI couldn’t bind Machines to Nodes (no workload connection, and no Node providerID to match), so helm --atomic --wait timed out at the 30m mark and rolled back — reverting good workloads (e.g. runners-controller 0.4.0 → 0.3.0). CAPI fleet management was only added to the chart ~3 weeks ago (#10499, #10653) and the kubeconfig + providerID wiring was never completed. The reload fragility then turned each provider-image change into a fleet-wide flap that re-failed deploys.
We chose to complete the CAPI wiring (rather than de-gate helm --wait from the fleets) so MD “available” status becomes a real fleet-health signal the deploy can legitimately gate on.
Validation
- Full green gated canary deploy from this branch: run
26589479994→helmrev 185deployed(“Upgrade complete”); all three fleet MDsREADY 1 / UNAVAILABLE 0 / Running; server / processor / xcresult-processor ready. The provider image reverting to the chart-pinned tag caused no drift roll (its baked kubelet SHA matches what the minis run), so the fleet stayed stable through the deploy. - providerID + codesign validated by an out-of-band provider roll on canary: rolling a provider image built from this branch re-rolled all three minis to the new kubelet, the
--provider-id=scw-applesilicon://…flag rendered and was honored, and every node’s providerID stayed unchanged (no-overwrite). The roll initially codesigning-killed 2 of 3 minis — which surfaced theinstallTartKubeletre-sign fix; a forced re-roll through the fixed path recovered all three (including a stranded macos-fleet, with no manual SSH),fail=none, providerIDs intact, all MDsREADY 1. - The reload fix validated live on the canary runners-fleet mini: the unchanged-plist path takes
kickstart -k, restarts the daemon, and the stable-PID check correctly distinguishes a real restart from a no-op. go build/go vet/go testpass acrosstart-kubelet,macos-host-bootstrap, andcluster-api-provider-scaleway-applesilicon. New unit tests cover the--provider-idplist rendering and theensureNodeset-on-create / patch-empty / never-overwrite behavior.helm templaterenders the RBAC resources + the ExternalSecret (creationPolicy Orphan, deletionPolicy Retain, pre-upgrade hook) correctly and is gated off by default.
One-time per cluster (not automatable — needs a runtime token)
The kubeconfig embeds the capi-remote SA token, which Helm can’t template, so it’s stored in 1Password like MASTER_KEY and friends. Steps are in onboarding.md §5b. Populated for staging / canary / production.
Follow-ups (separate)
--atomicrollback trap. Rolling back to a pre-adoptPoolPrefixrevision is rejected by CRD validation, so a failed deploy can wedge the release inpending-rollback. Tracked separately (drop--atomic, or prune the pre-adoptPoolPrefixrevisions).- Deletion-time mini leak. Machine deletion can strand the Scaleway mini (server left running while the k8s object hangs on its finalizer). The fix is to return minis to the
tuist-pool-pool on deletion rather than orphaning them. (Production’s stranded macos mini was returned to the pool by hand this session; staging’s stranded runners mini and an unnamed orphan are still pending.)
Test plan
- Populate
capi-workload-kubeconfigin the canary 1P vault (per onboarding §5b). - Deploy to canary; CAPI binds the fleet Machines (
RunningwithnodeRef) and the MachineDeployments go available;helm --waitpasses (rev 185 deployed). - Populate the 1P item for staging + production.
- Build a provider image from this branch and roll it on canary; confirm fleet nodes get
spec.providerIDset automatically and an already-patched node is left unchanged. - Confirm
helm --waitpasses on production and the cascade completes green (the kubelet re-roll patches production’s 11 empty-providerID nodes).
No GitHub comments yet.