Hive Hive
Sign in

feat(infra): wire CAPI core’s workload connection for the Mac-mini fleets

GitHub issue · Closed

Metadata
Source
tuist/tuist #10981
Updated
Jun 24, 2026
Domains
Compute
Details

What changed

The in-cluster CAPI that manages the Mac-mini fleets (macos / runners / builders) as Machine/MachineDeployment objects had several gaps that made every server deploy time out and roll back. This PR closes them.

1. CAPI workload connection (the wire-up)

Why a self-managed CAPI needs a kubeconfig to its own cluster. The Mac minis are real Nodes of this app cluster (the #10499 design — so macOS work like xcresult-processor and runners schedules as ordinary Deployments alongside everything else, rather than through a central fleet manager that would be a SPOF), and the CAPI that manages them runs inside this same cluster. That placement is deliberate, not incidental: the provider mints each mini’s join-credentials — API URL, CA, a bootstrap.kubernetes.io/token — from its own in-cluster ServiceAccount context, so a mini joins the cluster the provider runs in with no stored, cross-cluster credentials (see the provider’s AGENTS.md: “No 1Password entry, no manual rotation”). Running it from the Hetzner management cluster instead would either land the minis in mgmt (wrong cluster) or require injecting the app cluster’s API/CA and creating bootstrap tokens cross-cluster — exactly the plumbing this design avoids. The mgmt cluster is scoped to caph provisioning the Hetzner clusters themselves, not to bolting node pools onto already-running ones.

The residual cost of self-managing is that CAPI core’s generic cluster-cache still insists on connecting via a <cluster>-kubeconfig Secret — even when “the workload cluster” is itself. That Secret was missing, so CAPI fell back to the stub controlPlaneEndpoint (127.0.0.1) in capi-cluster.yaml and could never reach the cluster to watch Nodes / set Machine.status.nodeRef. This PR supplies it — the one part of the design that needs a stored credential (the bootstrap path above deliberately needs none):

  • capi-remote SA + scoped ClusterRole (nodes get/list/watch/patch/delete; pods get/list; pods/eviction create for drain; nonResourceURLs: ["/"] get for CAPI’s cluster-cache health probe) + a non-expiring SA-token Secret — capi-remote-rbac.yaml.
  • An ExternalSecret syncing the kubeconfig from 1Password into <release>-capi-kubeconfigcapi-remote-kubeconfig-external-secret.yaml. It’s a pre-upgrade hook (lands before helm --wait evaluates the fleet MDs) with creationPolicy: Orphan: the hook’s before-hook-creation policy deletes the prior ExternalSecret on every upgrade, and an Owner-created Secret would be garbage-collected with it (its ownerReference points at the ExternalSecret), disconnecting CAPI mid-deploy. Orphan keeps the Secret unowned; together with deletionPolicy: Retain the synced kubeconfig survives hook churn and an --atomic rollback.
  • Enabled for managed envs in values-managed-common.yaml; off by default so self-hosters without the 1P item are unaffected.

2. tart-kubelet reload robustness

The CAPI provider’s drift-loop reload ran bootout+bootstrap unconditionally and returned success on shell exit. On a headless Mac that pair races and re-registers the plist with Background Task Management, whose “legacy daemon” notification cap then stops honouring KeepAlive’s automatic respawn — so a clean exit never restarts and the Node goes NotReady. Because the reconciler recorded the SHA roll as done regardless, it never retried and the Node stayed stranded (observed flapping canary’s builders/runners fleets and blocking the deploy’s helm --wait).

loadTartKubeletLaunchd now:

  • Restarts in place with kickstart -k when the rendered plist is byte-identical to what’s installed (the common binary-only roll) — no BTM re-registration churn; only rewrites + bootstraps on an actual arg change.
  • Requires the new PID to be stable (same pid a few seconds later), not merely to appear, with a kickstart fallback, and exits non-zero otherwise so the reconciler keeps the drift set and retries instead of recording a roll that never took. (A no-op kickstart leaves the old process; a code-signing crash-loop briefly shows a transient pid on each respawn — the stability check rejects both.)

And installTartKubelet now codesign --force --sign -s the binary right after upload. Overwriting /usr/local/bin/tart-kubelet in place (same inode) leaves macOS’s AMFI validating the new pages against the previous binary’s cached cdhash, so the kernel kills the new process with OS_REASON_CODESIGNING and the Node goes NotReady — even though the Go linker ad-hoc signature is valid. A forced re-sign refreshes the signature and invalidates the stale cache so the rolled binary runs. (Surfaced by the canary providerID smoke-test, which codesigning-killed 2 of 3 minis on the roll.)

3. Node providerID — CAPI Node↔Machine binding

CAPI core binds a Machine to its Node by matching Node.spec.providerID == Machine.spec.providerID. tart-kubelet never set Node.spec.providerID, so every fleet node had to be patched by hand (kubectl patch node … providerID) before its MachineDeployment would report available — and any re-created/re-rolled node silently broke binding again. Threaded end to end:

  • The provider already computes scw-applesilicon://<zone>/<id> and stores it on Machine.spec.providerID; it’s now passed into bootstrap.Config.ProviderID for both first bootstrap (Run) and the drift roll (UpdateTartKubelet).
  • renderLaunchdPlist emits a --provider-id=<id> flag when set.
  • tart-kubelet’s nodeagent sets Node.spec.providerID at registration and, for nodes that registered before this change, patches it when empty (spec.providerID is immutable once set, so it’s never overwritten; the status-only heartbeat path can’t write spec, so the patch happens in ensureNode).

This removes the manual-patch step from onboarding, and existing empty-providerID nodes (e.g. production’s 11) get patched on the deploy’s kubelet re-roll — so the cascade binds the fleet without a manual N-node patch.

Why (root cause of the canary deploy failures)

Every fleet MachineDeployment was stuck 0 available because CAPI couldn’t bind Machines to Nodes (no workload connection, and no Node providerID to match), so helm --atomic --wait timed out at the 30m mark and rolled back — reverting good workloads (e.g. runners-controller 0.4.0 → 0.3.0). CAPI fleet management was only added to the chart ~3 weeks ago (#10499, #10653) and the kubeconfig + providerID wiring was never completed. The reload fragility then turned each provider-image change into a fleet-wide flap that re-failed deploys.

We chose to complete the CAPI wiring (rather than de-gate helm --wait from the fleets) so MD “available” status becomes a real fleet-health signal the deploy can legitimately gate on.

Validation

  • Full green gated canary deploy from this branch: run 26589479994helm rev 185 deployed (“Upgrade complete”); all three fleet MDs READY 1 / UNAVAILABLE 0 / Running; server / processor / xcresult-processor ready. The provider image reverting to the chart-pinned tag caused no drift roll (its baked kubelet SHA matches what the minis run), so the fleet stayed stable through the deploy.
  • providerID + codesign validated by an out-of-band provider roll on canary: rolling a provider image built from this branch re-rolled all three minis to the new kubelet, the --provider-id=scw-applesilicon://… flag rendered and was honored, and every node’s providerID stayed unchanged (no-overwrite). The roll initially codesigning-killed 2 of 3 minis — which surfaced the installTartKubelet re-sign fix; a forced re-roll through the fixed path recovered all three (including a stranded macos-fleet, with no manual SSH), fail=none, providerIDs intact, all MDs READY 1.
  • The reload fix validated live on the canary runners-fleet mini: the unchanged-plist path takes kickstart -k, restarts the daemon, and the stable-PID check correctly distinguishes a real restart from a no-op.
  • go build / go vet / go test pass across tart-kubelet, macos-host-bootstrap, and cluster-api-provider-scaleway-applesilicon. New unit tests cover the --provider-id plist rendering and the ensureNode set-on-create / patch-empty / never-overwrite behavior.
  • helm template renders the RBAC resources + the ExternalSecret (creationPolicy Orphan, deletionPolicy Retain, pre-upgrade hook) correctly and is gated off by default.

One-time per cluster (not automatable — needs a runtime token)

The kubeconfig embeds the capi-remote SA token, which Helm can’t template, so it’s stored in 1Password like MASTER_KEY and friends. Steps are in onboarding.md §5b. Populated for staging / canary / production.

Follow-ups (separate)

  • --atomic rollback trap. Rolling back to a pre-adoptPoolPrefix revision is rejected by CRD validation, so a failed deploy can wedge the release in pending-rollback. Tracked separately (drop --atomic, or prune the pre-adoptPoolPrefix revisions).
  • Deletion-time mini leak. Machine deletion can strand the Scaleway mini (server left running while the k8s object hangs on its finalizer). The fix is to return minis to the tuist-pool- pool on deletion rather than orphaning them. (Production’s stranded macos mini was returned to the pool by hand this session; staging’s stranded runners mini and an unnamed orphan are still pending.)

Test plan

  • Populate capi-workload-kubeconfig in the canary 1P vault (per onboarding §5b).
  • Deploy to canary; CAPI binds the fleet Machines (Running with nodeRef) and the MachineDeployments go available; helm --wait passes (rev 185 deployed).
  • Populate the 1P item for staging + production.
  • Build a provider image from this branch and roll it on canary; confirm fleet nodes get spec.providerID set automatically and an already-patched node is left unchanged.
  • Confirm helm --wait passes on production and the cascade completes green (the kubelet re-roll patches production’s 11 empty-providerID nodes).
Comments

No GitHub comments yet.