Hive Hive
Sign in

feat(server): account-scoped Runner Profiles (Linux v1)

GitHub issue · Closed

Metadata
Source
tuist/tuist #10970
Updated
Jun 24, 2026
Domains
Compute
Details

Customers can create per-account Runner Profiles that bundle vCPU and RAM behind a name and reference them from GitHub Actions workflows as runs-on: tuist-<name>. Changing a profile’s resources upgrades every workflow that uses it with no workflow-file edits. Linux only for v1; macOS follows.

The PR spans four areas: the customer-facing dashboard surface, the server dispatch/persistence layer, the Helm shape catalog the profiles resolve to, and a fleet-capacity-aware autoscaler so heterogeneous shapes pack onto the shared bare-metal fleet without one shape’s idle warm pods starving another’s real work. Plus capacity-fit observability on the Grafana dashboard.

Architecture

Three layers, increasingly customer-facing:

  1. Shape pools — Helm-rendered RunnerPool CRs, one per (vcpus, memory_gb) entry in the new runnersFleetLinux.shapes catalog. Deterministic names (runner-pool-linux-<vcpus>vcpu-<gb>gb) and internal dispatch labels (shape-linux-…); customers never reference these directly.
  2. CatalogTuist.Runners.Catalog reads the shape list from app config (:runner_linux_shapes). Helm injects that list (the same one it renders the pools from) as TUIST_RUNNER_LINUX_SHAPES, parsed in runtime.exs — single source of truth, so the catalog and the rendered pools can’t drift. Backs profile validation and the resources picker.
  3. Profiles — account-scoped Postgres rows (runner_profiles) carrying (name, vcpus, memory_gb). Profiles.match_for_dispatch/2 resolves (account, requested-label) through the profile to a shape pool.

Curated catalog over arbitrary specs: it keeps capacity planning tractable, removes on-demand pool creation, and matches what the rest of the market converged on. Named bundle over specs-in-label (tuist-4cpu-16gb): the named bundle is the only pattern that preserves “change specs without editing workflows.”

The shape catalog is the whole Linux fleet — the old single per-env pool (tuist-{env}-linux, 1 vCPU / 4 GB) is removed. Its labels are aliased in dispatch to the default shape (4 vCPU / 16 GB), so existing workflows keep running without edits and get ubuntu-latest parity. The default shape carries the warm floor the legacy pool used to (prod 30, staging/canary 2); the long-tail shapes scale to zero. The default size matches GitHub’s standard ubuntu-latest runner (4 vCPU / 16 GB), so migrating jobs see no resource regression.

Server

  • Schema: runner_profiles (Postgres). runner_jobs (ClickHouse) gains requested_dispatch_label so the customer-facing label rides through the queue. No data backfill — runners are pre-GA with a single internal consumer, so the one default profile is created by hand; new accounts create their own.
  • Dispatch: Dispatch.resolve_dispatch_target/2 resolves in three steps: (1) account profile (tuist-<name> → shape pool); (2) legacy-Linux alias — the env’s own retired tuist-{env}-linux label maps to the catalog default shape, with the original label preserved on the runner so GitHub still binds; (3) match_pool/1 for macOS and other non-shape Helm pools. The legacy alias is scoped to the current Environment.env(), so each env only handles the label that used to address its own pool — the GitHub App installation fans workflow_job events out to every env’s server, and an unfiltered alias would have each env enqueueing the others’ jobs and starving its own pool after any webhook redelivery flood.
  • JIT-mint: serve_claim stamps the customer-requested label from the runner_jobs row (not the pool’s internal dispatchLabel) so GitHub binds correctly; legacy rows with an empty value fall back to the pool label.
  • Profiles context: pure Postgres CRUD. Name is immutable post-create (it is the runs-on: identity; renaming would silently break workflows). Per-account cap of 10. Catalog-validated (vcpus, memory_gb).

Customer-facing UI

A single LiveView at /:account/runners/profiles under the Runners sidebar group (alongside Workflows and Jobs), built to match the shipped runners dashboard conventions:

  • Noora card header + table with Name, Label (tuist-<name>), Platform (badge), vCPU, RAM, Last used columns. Kebab-menu row actions for Edit / Delete.
  • Create and Edit run through one Noora modal: an inline tuist- prefix on the name field, a Platform picker (Linux today, structured for macOS next), and a Resources picker populated from the catalog. Both pickers checkmark the selected item.
  • Last-used is derived from runner_jobs.requested_dispatch_label.
  • Seeds enable runners and add default / large / xlarge profiles for both dev accounts; the seed also re-stamps the test user’s password each run so “Log in as test user” survives secret rotation.

Fleet-capacity-aware autoscaling (cross-pool rebalancer)

Linux shape pools share one bare-metal node pool, so their speculative warm headroom competes for the same memory. Left per-pool, the autoscaler could ask for more pods than the fleet can host, and an idle shape’s warm buffer could sit pinned while another shape’s real queued work went Pending.

os: linux pools now run their per-pool target through internal/scaling/allocate.go’s AllocateFleet, a three-tier waterfall over pools sharing a FleetSelector:

  1. Floor — every pool’s minWarmPoolFloor (warm guarantee).
  2. Loadclaimed + queued real work above the floor.
  3. Headroom — the speculative p95 buffer, from whatever memory is left.

Tiers 1+2 are always honored in full (excess goes Pending, the operator’s add-a-host signal). Tier 3 is squeezed proportionally under contention, which is the cross-pool reclaim: an idle shape’s warm pods fall back toward its floor so a starved shape’s real work fits. Memory-only (kata pins it per microVM; CPU is oversubscribed). Fleet allocatable is summed from the runners-linux node pool (cluster-scoped nodes read added to the ClusterRole), scaled by a reserve fraction (default 0.9). Any failure gathering the fleet view falls back to the per-pool target, so a node-read blip never triggers a mass scale-down. macOS pools (one VM per host, no bin-packing) keep the plain per-pool path.

One loop still owns spec.replicas (no competing controllers), and it reuses the existing safety primitive: RunnerPoolReconciler only ever reaps idle pods (no tuist.dev/runner-pool-owner label), so shrinking an over-provisioned shape never kills an in-flight build.

The PriorityClass + preemption alternative was rejected: pod priority is immutable post-admission, and preemption keys off priority value, not claim state, so it cannot distinguish warm from claimed pods and would kill running builds.

Observability

The runners Grafana dashboard gains a Capacity fit (bin-packing) row plus an Overview Unscheduled stat:

  • Unscheduled replicas (desired − observed) by shape — built from existing PromEx gauges, works today, the per-shape fragmentation signal.
  • Pending runner pods and Linux fleet memory (requested vs allocatable) — kube-state-metrics ground truth; degrade to no-data if KSM is not scraped.

runners.json was also normalized to the fully-expanded formatting the other six dashboards use.

Capacity planning / known tradeoff

The multi-shape catalog is harder to bin-pack than the prior single-shape pool: heterogeneous pod sizes sharing hosts can fragment (a large shape may go Pending even when aggregate free capacity exists). The fleet-aware autoscaler prevents the autoscaler from over-committing and reclaims idle headroom for real work, but the hard fit guarantee remains the kube-scheduler (never overcommits a node). Operators size the fleet against a shape mix rather than a flat slot count and watch the new Pending-by-shape panel as the provisioning signal. Lighter config-only levers if fragmentation bites: dedicate a node pool per shape class, choose cleanly-tiling shape sizes, or set floor: 0 for rare shapes.

Out of scope (v1.1+)

  • Ubuntu version / arch / custom image
  • macOS profiles (the platform picker is structured for it)
  • CLI commands and a public REST API
  • Per-profile concurrency caps (account-wide runner_max_concurrent still gates dispatch)

Test plan

  • Server: mix test test/tuist/runners/ test/tuist_web/live/runner_profiles_live_test.exs test/tuist_web/live/runners_live_test.exs test/tuist_web/live/runner_workflows_live_test.exs test/tuist_web/live/runner_jobs_live_test.exs — all pass (12 profiles-context, 6 LiveView, 3 new dispatch cases, plus the pre-existing runners suites)
  • Controller (Go): go test ./... in infra/runners-controller — all pass, including AllocateFleet table tests (7 cases incl. the reclaim scenario) and a controller test proving an idle shape is held at floor while a busy shape keeps its real load
  • mix compile --warnings-as-errors, mix format --check-formatted, mix credo clean; prettier clean on touched CSS; gofmt/go vet clean
  • helm template renders the shape pools and the new nodes RBAC rule cleanly
  • mix gettext.extract — only .pot files changed
  • Browser-verified locally end to end: create / edit / delete profile via the modal, platform + resources pickers with checkmarks, badge alignment, the list columns, and the seeded data across both dev accounts
  • Staging end-to-end: deployed the PR HEAD to staging.tuist.dev, created a staging-linux profile in the dashboard (label tuist-staging-linux, 4 vCPU / 16 GB), and dispatched linux-runners-staging-smoke.yml against the real GitHub App installation. Run 26783125271 completed green: webhook → runners: enqueued fleet=tuist-tuist-runner-pool-linux-4vcpu-16gb dispatch_label=tuist-staging-linux → polling Pod claimed → JIT minted carrying the original label → GitHub bound and the job’s nproc / MemTotal assertions ran inside the kata microVM

How to test locally

  1. mise run db:reset then mise run dev, log in as the seeded test user
  2. Visit /<account>/runners/profiles — three seeded profiles per account
  3. New profile: type a name (note the tuist- prefix), pick platform + resources (checkmarks on the selected items), save; edit one (name disabled, resources mutable); delete one
  4. In a workflow, set runs-on: tuist-<name> and trigger a job; the webhook enqueues against the matching shape pool

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 3, 2026

Account-scoped Runner Profiles (Linux v1) are now available in server@1.204.0. Update to use this feature.

TA
tuist-atlas[bot] Jun 3, 2026

The account-scoped Runner Profiles (Linux v1) feature is now available in runners-controller@0.5.0. Please update to use it.

TA
tuist-atlas[bot] Jun 5, 2026

The changes from this PR are now available in release xcresult-processor-image@0.11.0. Account-scoped Runner Profiles (Linux v1) are now available.