Hive
feat(infra): cluster-managed vm-image-builder fleet
GitHub issue · Closed
Summary
Move the bare-metal vm-image-builder fleet from one-off hand-bootstrapping into the same CAPI-managed lifecycle the rest of the macOS fleet already uses. Scale with kubectl scale machinedeployment (or helm upgrade --set buildersFleet.replicas=N) instead of ordering a Mac mini + running a bespoke command per host.
Builder hosts register as regular Kubernetes Nodes via tart-kubelet — identical to the existing macosFleet (xcresult-processor) and runnersFleet (customer-runner) hosts — but no Pod selects on tuist.dev/fleet=<buildersFleet>, so tart-kubelet stays idle. On top of the standard Node bootstrap, the reconciler installs an additional LaunchAgent: a GitHub Actions self-hosted runner agent that picks up image-bake workflow jobs and runs packer build directly against the host’s own Tart daemon.
Why a builder host can’t be a Pod: Apple Silicon Virtualization.framework refuses to nest macOS guests inside macOS guests, so Packer’s tart run from inside a macOS Pod-VM fails. The agent has to live on bare-metal Tart, with kubelet idle. But the host can still be a Node — we just get drift reconciliation, observability via kubectl get nodes, and the existing EnsureRealGUISession preflight tart-kubelet does on every VM start, for free.
What changed
ScalewayAppleSiliconMachineSpec.GHActionsRunner— new optional sub-spec on the existing CR. Carries the runner org / labels / version + the name of a K8s Secret holding GitHub App credentials. Pure Node hosts leave the sub-spec nil and behave as before — additive change, no breaking surface.internal/runner/package +Resolverinterface — the seam between the CR’s declaredGHActionsRunnerspec and a runtime-readybootstrap.GHActionsRunnerConfigwith a fresh registration token. The Scaleway machine reconciler depends only on the interface; the production-wired*GitHubAppResolverreads a three-field K8s Secret (app-id,installation-id, PEMprivate-key) and exchanges those long-lived credentials for a short-lived registration token via the public GitHub API (JWT → installation token → registration token). PAT / GitLab / Forgejo equivalents are new structs implementing the same interface, no edits to the Machine controller.internal/githubapp/package — stdlib-only RS256 JWT signing + the two-step GitHub API exchange. Accepts both PKCS#1 and PKCS#8 PEM private keys so operators don’t have to know which format their tooling produced.infra/macos-host-bootstrap/actions_runner.go—runActionsRunnerInstallcomposes: Homebrew install of Packer (hashicorp/tap since the BSL relicense) andcrane(for GHCR auth beforetart push), Xcode verify,/etc/zshenvwithTUIST_MIX_BUILD_ROOT=/opt/tuist-build-cache,./config.shregistration with the resolver-minted token (fed via stdin so a config.sh failure can’t leak it through K8s events), andlaunchctl bootstrap gui/<UID>to load the LaunchAgent. Tart is installed by the upstream Node bootstrap path from the operator-image-bakedtart.apptarball — same install path the other two fleets use — not via Homebrew. Packer isbrew pin’d so subsequentbrew upgradecalls are no-ops (macOS Tahoe’s Local Network access TCC grant is keyed on binary code-signature, see Tahoe gotchas below).- Scaleway IAM
iam:ListSSHKeys/iam:CreateSSHKeycalls now passProjectID— Scaleway tightened enforcement on project-scopedSSHKeysFullAccesspolicies; the SDK’s fallback to org-wide listing without an explicitProjectIDfilter started returninginsufficient permissions: list ssh_key. Surfaced when adding a new fleet to an existing cluster. - LaunchAgent load via
launchctl bootstrap gui/<UID>— replaces the upstream actions-runnersvc.sh startwhich uses the legacylaunchctl loadAPI. On macOS Tahoe over SSH the legacy API can’t reach the user’s Aqua session and dies withFailed: failed to load …. Thegui/<UID>domain targets the Aqua session explicitly. Reclaim Tart diskworkflow step filters ontart list … Source=localinstead of hardcoding VM names — every locally-cloned VM gets cleaned regardless of which workflow created it, so future image-bake workflows added on the same host don’t need this step touched.tart login ghcr.iostep removed fromrunner-image.yml,xcresult-processor-image.yml, and both occurrences inrelease.yml.tart loginwrites to the macOS keychain, which on Tahoe-host LaunchAgents fails withUser interaction is not allowed. The Push step usesTART_REGISTRY_USERNAME/TART_REGISTRY_PASSWORDenv-vars (which bypass keychain entirely) — the explicit login was redundant.brew upgrade tartstep dropped from the same three workflows. Tart upgrades are now an explicit operator action (re-baking the operator image) so the Local Network TCC grant stays stable across workflow dispatches. Tart is installed once at bootstrap time from the operator-baked tarball.infra/helm/tuist/templates/builders-fleet{,-external-secrets}.yaml— peer ofrunners-fleet.yamlandmacos-fleet.yaml. ScalewayAppleSiliconMachineTemplate + MachineDeployment + ExternalSecret syncing the three GitHub App credential fields from one 1Password item (BUILDERS_FLEET_GITHUB_APP) into a K8s Secret. Shares thetuist-pool-adoption prefix with the other two fleets so the existing pre-ordered pool absorbs builder demand.infra/helm/tuist/values.yaml—buildersFleetblock (enabled: falseby default; flipped on per env invalues-managed-{staging,canary,production}.yaml).- Chart-default macOS image bumped to
macos-tahoe-26.3across all three fleet templates + the matching CR+kubebuilder:defaultso a freshhelm installwith no env overrides produces a single-OS fleet shape. infra/vm-image-builder.md— operator runbook rewritten aroundhelm upgrade+kubectl scale. Documents the one-time per-env GitHub App creation, the three-field 1Password stash, thetuist-pool-pre-order workflow, the Tahoe Local Network access permission step, the Packer pinning rationale, the manual seeding required when adding a new fleet to a cluster with pre-existing pool hosts, and the migration off any hand-bootstrapped hosts.infra/{runner-image,xcresult-processor-image}/*.pkr.hcl—ssh_timeoutbumped 120s → 15m to absorb cold-clone first-boot, andpackernow installs fromhashicorp/tapsince core Homebrew dropped it.
What got removed
infra/vm-image-builder-bootstrap/— the standalone Go CLI we started with. Logic moved intomacos-host-bootstrap/actions_runner.goand is invoked by the CAPI reconciler.mise/tasks/vm-image-builder/bootstrap.sh— themise runwrapper.
macOS Tahoe gotchas baked into this PR
Discovered while validating end-to-end on -01 (the existing hand-bootstrapped host) and on the first cluster-managed staging builder, all of which apply to every new host too:
- GHCR keychain wall.
tart loginrequires a keychain UI prompt that LaunchAgents can’t surface. Worked around by removing the redundant login step (env-var auth on Push is sufficient). - Local Network access TCC grant. Tahoe gates
192.168.64.0/22(Tart’s vmnet subnet) behind a per-binary-signature “wants to access devices on your local network” permission. SIP-protected, can’t be pre-granted programmatically. Operator must VNC into each new host once and click Allow — documented in the runbook. - Local Network grant invalidated by binary upgrades. Any
brew upgrade packerreplaces the binary with a new code-signature → revokes the grant → next build hangs onWaiting for SSH...until someone clicks Allow again. PR pins Packer in the bootstrap and drops the implicit-upgrade steps so the grant stays stable; Tart upgrades go through operator-image bumps. - LaunchAgent load over SSH. The upstream
svc.sh startuses legacylaunchctl loadwhich can’t reach the user’s Aqua session over SSH. Replaced withlaunchctl bootstrap gui/<UID>targeting the GUI domain explicitly.
Future work: MDM-managed TCC profiles
The Local Network Allow click is the only remaining manual step in onboarding a new host. The official Apple-supported path to eliminate it is MDM Configuration Profiles (PPPC profiles) deployed via an Apple Business Manager-enrolled MDM (Jamf, Kandji, Mosyle, etc.). A signed .mobileconfig profile pre-grants TCC permissions by binary path or code-signature requirement, so a freshly-provisioned host comes up with packer already authorised. Requires enrolling Scaleway Mac minis in ABM and paying for an MDM seat per host. Worth revisiting once the fleet grows past a handful of hosts; for now the one-time Allow click per host is the pragmatic shape.
Operator workflow after this lands
One-time per env (staging / canary / production):
- Create a “Tuist Builders Fleet [Env]” GitHub App on the tuist org with
Organization → Self-hosted runners: Read and write. Install on tuist org, note App ID + installation ID, generate + download a private key. - Stash all three in 1Password:
op://tuist-k8s-<env>/BUILDERS_FLEET_GITHUB_APP/{app-id,installation-id,private-key}. Never rotated thereafter. - Pre-order Mac minis on Scaleway with the
tuist-pool-prefix (shared pool with the other two fleets) — bump capacity by 1 to absorb the new fleet’s demand.
Per scale-up:
4. kubectl scale machinedeployment <release>-builders-fleet --replicas=N or edit replicas in the env values file.
5. VNC into each new host once and click Allow on the Local Network prompt (runbook calls this out). The runner agent registers with GitHub immediately; the first image-bake workflow dispatched at it surfaces the prompt.
Test plan
- All six infra Go modules build + test clean
-
helm templateandhelm lintclean across permutations: default off, ESO-driven, self-host-managed Secret, production-shape with adoption - CR types + CRD manifests regenerated for the renamed
GHAppSecretNamefield;Resolverinterface fronts the GitHub-specific resolution -
internal/githubappunit-tested (happy path, error propagation, missing-credential validation, JWT roundtrip with public-key verification, PKCS#8 PEM acceptance) -
internal/runnerunit-tested (nil-spec short-circuit, happy path, whitespace trimming on numeric fields, missing-field error messages, minter-error propagation) - Tahoe gotchas surfaced + fixed: GHCR keychain bypass,
packerpinned, Local Network Allow documented, LaunchAgent loaded viagui/<UID> - End-to-end credential exchange validated against real GitHub for staging + canary + production Apps (
go test -tags e2e -run TestE2E_MintAgainstRealGitHub …) - First cluster-managed staging builder host through the reconciler:
tuist-pool-01adopted → bootstrap completed → runnertuist-tuist-builders-fleet-bs9kw-h4np6(id 54982) online + accepting jobs (smoke workflow dispatched against it, 5s pickup, all tooling present)
🤖 Generated with Claude Code
No GitHub comments yet.