Hive Hive
Sign in

feat(infra): cluster-managed vm-image-builder fleet

GitHub issue · Closed

Metadata
Source
tuist/tuist #10825
Updated
Jun 24, 2026
Domains
Compute
Details

Summary

Move the bare-metal vm-image-builder fleet from one-off hand-bootstrapping into the same CAPI-managed lifecycle the rest of the macOS fleet already uses. Scale with kubectl scale machinedeployment (or helm upgrade --set buildersFleet.replicas=N) instead of ordering a Mac mini + running a bespoke command per host.

Builder hosts register as regular Kubernetes Nodes via tart-kubelet — identical to the existing macosFleet (xcresult-processor) and runnersFleet (customer-runner) hosts — but no Pod selects on tuist.dev/fleet=<buildersFleet>, so tart-kubelet stays idle. On top of the standard Node bootstrap, the reconciler installs an additional LaunchAgent: a GitHub Actions self-hosted runner agent that picks up image-bake workflow jobs and runs packer build directly against the host’s own Tart daemon.

Why a builder host can’t be a Pod: Apple Silicon Virtualization.framework refuses to nest macOS guests inside macOS guests, so Packer’s tart run from inside a macOS Pod-VM fails. The agent has to live on bare-metal Tart, with kubelet idle. But the host can still be a Node — we just get drift reconciliation, observability via kubectl get nodes, and the existing EnsureRealGUISession preflight tart-kubelet does on every VM start, for free.

What changed

  • ScalewayAppleSiliconMachineSpec.GHActionsRunner — new optional sub-spec on the existing CR. Carries the runner org / labels / version + the name of a K8s Secret holding GitHub App credentials. Pure Node hosts leave the sub-spec nil and behave as before — additive change, no breaking surface.
  • internal/runner/ package + Resolver interface — the seam between the CR’s declared GHActionsRunner spec and a runtime-ready bootstrap.GHActionsRunnerConfig with a fresh registration token. The Scaleway machine reconciler depends only on the interface; the production-wired *GitHubAppResolver reads a three-field K8s Secret (app-id, installation-id, PEM private-key) and exchanges those long-lived credentials for a short-lived registration token via the public GitHub API (JWT → installation token → registration token). PAT / GitLab / Forgejo equivalents are new structs implementing the same interface, no edits to the Machine controller.
  • internal/githubapp/ package — stdlib-only RS256 JWT signing + the two-step GitHub API exchange. Accepts both PKCS#1 and PKCS#8 PEM private keys so operators don’t have to know which format their tooling produced.
  • infra/macos-host-bootstrap/actions_runner.gorunActionsRunnerInstall composes: Homebrew install of Packer (hashicorp/tap since the BSL relicense) and crane (for GHCR auth before tart push), Xcode verify, /etc/zshenv with TUIST_MIX_BUILD_ROOT=/opt/tuist-build-cache, ./config.sh registration with the resolver-minted token (fed via stdin so a config.sh failure can’t leak it through K8s events), and launchctl bootstrap gui/<UID> to load the LaunchAgent. Tart is installed by the upstream Node bootstrap path from the operator-image-baked tart.app tarball — same install path the other two fleets use — not via Homebrew. Packer is brew pin’d so subsequent brew upgrade calls are no-ops (macOS Tahoe’s Local Network access TCC grant is keyed on binary code-signature, see Tahoe gotchas below).
  • Scaleway IAM iam:ListSSHKeys / iam:CreateSSHKey calls now pass ProjectID — Scaleway tightened enforcement on project-scoped SSHKeysFullAccess policies; the SDK’s fallback to org-wide listing without an explicit ProjectID filter started returning insufficient permissions: list ssh_key. Surfaced when adding a new fleet to an existing cluster.
  • LaunchAgent load via launchctl bootstrap gui/<UID> — replaces the upstream actions-runner svc.sh start which uses the legacy launchctl load API. On macOS Tahoe over SSH the legacy API can’t reach the user’s Aqua session and dies with Failed: failed to load …. The gui/<UID> domain targets the Aqua session explicitly.
  • Reclaim Tart disk workflow step filters on tart list … Source=local instead of hardcoding VM names — every locally-cloned VM gets cleaned regardless of which workflow created it, so future image-bake workflows added on the same host don’t need this step touched.
  • tart login ghcr.io step removed from runner-image.yml, xcresult-processor-image.yml, and both occurrences in release.yml. tart login writes to the macOS keychain, which on Tahoe-host LaunchAgents fails with User interaction is not allowed. The Push step uses TART_REGISTRY_USERNAME/TART_REGISTRY_PASSWORD env-vars (which bypass keychain entirely) — the explicit login was redundant.
  • brew upgrade tart step dropped from the same three workflows. Tart upgrades are now an explicit operator action (re-baking the operator image) so the Local Network TCC grant stays stable across workflow dispatches. Tart is installed once at bootstrap time from the operator-baked tarball.
  • infra/helm/tuist/templates/builders-fleet{,-external-secrets}.yaml — peer of runners-fleet.yaml and macos-fleet.yaml. ScalewayAppleSiliconMachineTemplate + MachineDeployment + ExternalSecret syncing the three GitHub App credential fields from one 1Password item (BUILDERS_FLEET_GITHUB_APP) into a K8s Secret. Shares the tuist-pool- adoption prefix with the other two fleets so the existing pre-ordered pool absorbs builder demand.
  • infra/helm/tuist/values.yamlbuildersFleet block (enabled: false by default; flipped on per env in values-managed-{staging,canary,production}.yaml).
  • Chart-default macOS image bumped to macos-tahoe-26.3 across all three fleet templates + the matching CR +kubebuilder:default so a fresh helm install with no env overrides produces a single-OS fleet shape.
  • infra/vm-image-builder.md — operator runbook rewritten around helm upgrade + kubectl scale. Documents the one-time per-env GitHub App creation, the three-field 1Password stash, the tuist-pool- pre-order workflow, the Tahoe Local Network access permission step, the Packer pinning rationale, the manual seeding required when adding a new fleet to a cluster with pre-existing pool hosts, and the migration off any hand-bootstrapped hosts.
  • infra/{runner-image,xcresult-processor-image}/*.pkr.hclssh_timeout bumped 120s → 15m to absorb cold-clone first-boot, and packer now installs from hashicorp/tap since core Homebrew dropped it.

What got removed

  • infra/vm-image-builder-bootstrap/ — the standalone Go CLI we started with. Logic moved into macos-host-bootstrap/actions_runner.go and is invoked by the CAPI reconciler.
  • mise/tasks/vm-image-builder/bootstrap.sh — the mise run wrapper.

macOS Tahoe gotchas baked into this PR

Discovered while validating end-to-end on -01 (the existing hand-bootstrapped host) and on the first cluster-managed staging builder, all of which apply to every new host too:

  1. GHCR keychain wall. tart login requires a keychain UI prompt that LaunchAgents can’t surface. Worked around by removing the redundant login step (env-var auth on Push is sufficient).
  2. Local Network access TCC grant. Tahoe gates 192.168.64.0/22 (Tart’s vmnet subnet) behind a per-binary-signature “wants to access devices on your local network” permission. SIP-protected, can’t be pre-granted programmatically. Operator must VNC into each new host once and click Allow — documented in the runbook.
  3. Local Network grant invalidated by binary upgrades. Any brew upgrade packer replaces the binary with a new code-signature → revokes the grant → next build hangs on Waiting for SSH... until someone clicks Allow again. PR pins Packer in the bootstrap and drops the implicit-upgrade steps so the grant stays stable; Tart upgrades go through operator-image bumps.
  4. LaunchAgent load over SSH. The upstream svc.sh start uses legacy launchctl load which can’t reach the user’s Aqua session over SSH. Replaced with launchctl bootstrap gui/<UID> targeting the GUI domain explicitly.

Future work: MDM-managed TCC profiles

The Local Network Allow click is the only remaining manual step in onboarding a new host. The official Apple-supported path to eliminate it is MDM Configuration Profiles (PPPC profiles) deployed via an Apple Business Manager-enrolled MDM (Jamf, Kandji, Mosyle, etc.). A signed .mobileconfig profile pre-grants TCC permissions by binary path or code-signature requirement, so a freshly-provisioned host comes up with packer already authorised. Requires enrolling Scaleway Mac minis in ABM and paying for an MDM seat per host. Worth revisiting once the fleet grows past a handful of hosts; for now the one-time Allow click per host is the pragmatic shape.

Operator workflow after this lands

One-time per env (staging / canary / production):

  1. Create a “Tuist Builders Fleet [Env]” GitHub App on the tuist org with Organization → Self-hosted runners: Read and write. Install on tuist org, note App ID + installation ID, generate + download a private key.
  2. Stash all three in 1Password: op://tuist-k8s-<env>/BUILDERS_FLEET_GITHUB_APP/{app-id,installation-id,private-key}. Never rotated thereafter.
  3. Pre-order Mac minis on Scaleway with the tuist-pool- prefix (shared pool with the other two fleets) — bump capacity by 1 to absorb the new fleet’s demand.

Per scale-up: 4. kubectl scale machinedeployment <release>-builders-fleet --replicas=N or edit replicas in the env values file. 5. VNC into each new host once and click Allow on the Local Network prompt (runbook calls this out). The runner agent registers with GitHub immediately; the first image-bake workflow dispatched at it surfaces the prompt.

Test plan

  • All six infra Go modules build + test clean
  • helm template and helm lint clean across permutations: default off, ESO-driven, self-host-managed Secret, production-shape with adoption
  • CR types + CRD manifests regenerated for the renamed GHAppSecretName field; Resolver interface fronts the GitHub-specific resolution
  • internal/githubapp unit-tested (happy path, error propagation, missing-credential validation, JWT roundtrip with public-key verification, PKCS#8 PEM acceptance)
  • internal/runner unit-tested (nil-spec short-circuit, happy path, whitespace trimming on numeric fields, missing-field error messages, minter-error propagation)
  • Tahoe gotchas surfaced + fixed: GHCR keychain bypass, packer pinned, Local Network Allow documented, LaunchAgent loaded via gui/<UID>
  • End-to-end credential exchange validated against real GitHub for staging + canary + production Apps (go test -tags e2e -run TestE2E_MintAgainstRealGitHub …)
  • First cluster-managed staging builder host through the reconciler: tuist-pool-01 adopted → bootstrap completed → runner tuist-tuist-builders-fleet-bs9kw-h4np6 (id 54982) online + accepting jobs (smoke workflow dispatched against it, 5s pickup, all tooling present)

🤖 Generated with Claude Code

Comments

No GitHub comments yet.