Hive
feat(server): account-scoped Runner Profiles (Linux v1)
GitHub issue · Closed
Customers can create per-account Runner Profiles that bundle vCPU and RAM behind a name and reference them from GitHub Actions workflows as runs-on: tuist-<name>. Changing a profile’s resources upgrades every workflow that uses it with no workflow-file edits. Linux only for v1; macOS follows.
The PR spans four areas: the customer-facing dashboard surface, the server dispatch/persistence layer, the Helm shape catalog the profiles resolve to, and a fleet-capacity-aware autoscaler so heterogeneous shapes pack onto the shared bare-metal fleet without one shape’s idle warm pods starving another’s real work. Plus capacity-fit observability on the Grafana dashboard.
Architecture
Three layers, increasingly customer-facing:
- Shape pools — Helm-rendered
RunnerPoolCRs, one per(vcpus, memory_gb)entry in the newrunnersFleetLinux.shapescatalog. Deterministic names (runner-pool-linux-<vcpus>vcpu-<gb>gb) and internal dispatch labels (shape-linux-…); customers never reference these directly. - Catalog —
Tuist.Runners.Catalogreads the shape list from app config (:runner_linux_shapes). Helm injects that list (the same one it renders the pools from) asTUIST_RUNNER_LINUX_SHAPES, parsed inruntime.exs— single source of truth, so the catalog and the rendered pools can’t drift. Backs profile validation and the resources picker. - Profiles — account-scoped Postgres rows (
runner_profiles) carrying(name, vcpus, memory_gb).Profiles.match_for_dispatch/2resolves(account, requested-label)through the profile to a shape pool.
Curated catalog over arbitrary specs: it keeps capacity planning tractable, removes on-demand pool creation, and matches what the rest of the market converged on. Named bundle over specs-in-label (tuist-4cpu-16gb): the named bundle is the only pattern that preserves “change specs without editing workflows.”
The shape catalog is the whole Linux fleet — the old single per-env pool (tuist-{env}-linux, 1 vCPU / 4 GB) is removed. Its labels are aliased in dispatch to the default shape (4 vCPU / 16 GB), so existing workflows keep running without edits and get ubuntu-latest parity. The default shape carries the warm floor the legacy pool used to (prod 30, staging/canary 2); the long-tail shapes scale to zero. The default size matches GitHub’s standard ubuntu-latest runner (4 vCPU / 16 GB), so migrating jobs see no resource regression.
Server
- Schema:
runner_profiles(Postgres).runner_jobs(ClickHouse) gainsrequested_dispatch_labelso the customer-facing label rides through the queue. No data backfill — runners are pre-GA with a single internal consumer, so the onedefaultprofile is created by hand; new accounts create their own. - Dispatch:
Dispatch.resolve_dispatch_target/2resolves in three steps: (1) account profile (tuist-<name>→ shape pool); (2) legacy-Linux alias — the env’s own retiredtuist-{env}-linuxlabel maps to the catalog default shape, with the original label preserved on the runner so GitHub still binds; (3)match_pool/1for macOS and other non-shape Helm pools. The legacy alias is scoped to the currentEnvironment.env(), so each env only handles the label that used to address its own pool — the GitHub App installation fansworkflow_jobevents out to every env’s server, and an unfiltered alias would have each env enqueueing the others’ jobs and starving its own pool after any webhook redelivery flood. - JIT-mint:
serve_claimstamps the customer-requested label from therunner_jobsrow (not the pool’s internaldispatchLabel) so GitHub binds correctly; legacy rows with an empty value fall back to the pool label. - Profiles context: pure Postgres CRUD. Name is immutable post-create (it is the
runs-on:identity; renaming would silently break workflows). Per-account cap of 10. Catalog-validated(vcpus, memory_gb).
Customer-facing UI
A single LiveView at /:account/runners/profiles under the Runners sidebar group (alongside Workflows and Jobs), built to match the shipped runners dashboard conventions:
- Noora card header + table with
Name,Label(tuist-<name>),Platform(badge),vCPU,RAM,Last usedcolumns. Kebab-menu row actions for Edit / Delete. - Create and Edit run through one Noora modal: an inline
tuist-prefix on the name field, a Platform picker (Linux today, structured for macOS next), and a Resources picker populated from the catalog. Both pickers checkmark the selected item. - Last-used is derived from
runner_jobs.requested_dispatch_label. - Seeds enable runners and add
default/large/xlargeprofiles for both dev accounts; the seed also re-stamps the test user’s password each run so “Log in as test user” survives secret rotation.
Fleet-capacity-aware autoscaling (cross-pool rebalancer)
Linux shape pools share one bare-metal node pool, so their speculative warm headroom competes for the same memory. Left per-pool, the autoscaler could ask for more pods than the fleet can host, and an idle shape’s warm buffer could sit pinned while another shape’s real queued work went Pending.
os: linux pools now run their per-pool target through internal/scaling/allocate.go’s AllocateFleet, a three-tier waterfall over pools sharing a FleetSelector:
- Floor — every pool’s
minWarmPoolFloor(warm guarantee). - Load —
claimed + queuedreal work above the floor. - Headroom — the speculative p95 buffer, from whatever memory is left.
Tiers 1+2 are always honored in full (excess goes Pending, the operator’s add-a-host signal). Tier 3 is squeezed proportionally under contention, which is the cross-pool reclaim: an idle shape’s warm pods fall back toward its floor so a starved shape’s real work fits. Memory-only (kata pins it per microVM; CPU is oversubscribed). Fleet allocatable is summed from the runners-linux node pool (cluster-scoped nodes read added to the ClusterRole), scaled by a reserve fraction (default 0.9). Any failure gathering the fleet view falls back to the per-pool target, so a node-read blip never triggers a mass scale-down. macOS pools (one VM per host, no bin-packing) keep the plain per-pool path.
One loop still owns spec.replicas (no competing controllers), and it reuses the existing safety primitive: RunnerPoolReconciler only ever reaps idle pods (no tuist.dev/runner-pool-owner label), so shrinking an over-provisioned shape never kills an in-flight build.
The PriorityClass + preemption alternative was rejected: pod priority is immutable post-admission, and preemption keys off priority value, not claim state, so it cannot distinguish warm from claimed pods and would kill running builds.
Observability
The runners Grafana dashboard gains a Capacity fit (bin-packing) row plus an Overview Unscheduled stat:
- Unscheduled replicas (desired − observed) by shape — built from existing PromEx gauges, works today, the per-shape fragmentation signal.
- Pending runner pods and Linux fleet memory (requested vs allocatable) — kube-state-metrics ground truth; degrade to no-data if KSM is not scraped.
runners.json was also normalized to the fully-expanded formatting the other six dashboards use.
Capacity planning / known tradeoff
The multi-shape catalog is harder to bin-pack than the prior single-shape pool: heterogeneous pod sizes sharing hosts can fragment (a large shape may go Pending even when aggregate free capacity exists). The fleet-aware autoscaler prevents the autoscaler from over-committing and reclaims idle headroom for real work, but the hard fit guarantee remains the kube-scheduler (never overcommits a node). Operators size the fleet against a shape mix rather than a flat slot count and watch the new Pending-by-shape panel as the provisioning signal. Lighter config-only levers if fragmentation bites: dedicate a node pool per shape class, choose cleanly-tiling shape sizes, or set floor: 0 for rare shapes.
Out of scope (v1.1+)
- Ubuntu version / arch / custom image
- macOS profiles (the platform picker is structured for it)
- CLI commands and a public REST API
- Per-profile concurrency caps (account-wide
runner_max_concurrentstill gates dispatch)
Test plan
- Server:
mix test test/tuist/runners/ test/tuist_web/live/runner_profiles_live_test.exs test/tuist_web/live/runners_live_test.exs test/tuist_web/live/runner_workflows_live_test.exs test/tuist_web/live/runner_jobs_live_test.exs— all pass (12 profiles-context, 6 LiveView, 3 new dispatch cases, plus the pre-existing runners suites) - Controller (Go):
go test ./...ininfra/runners-controller— all pass, includingAllocateFleettable tests (7 cases incl. the reclaim scenario) and a controller test proving an idle shape is held at floor while a busy shape keeps its real load -
mix compile --warnings-as-errors,mix format --check-formatted,mix credoclean; prettier clean on touched CSS;gofmt/go vetclean -
helm templaterenders the shape pools and the newnodesRBAC rule cleanly -
mix gettext.extract— only.potfiles changed - Browser-verified locally end to end: create / edit / delete profile via the modal, platform + resources pickers with checkmarks, badge alignment, the list columns, and the seeded data across both dev accounts
- Staging end-to-end: deployed the PR HEAD to
staging.tuist.dev, created astaging-linuxprofile in the dashboard (labeltuist-staging-linux, 4 vCPU / 16 GB), and dispatchedlinux-runners-staging-smoke.ymlagainst the real GitHub App installation. Run 26783125271 completed green: webhook →runners: enqueued fleet=tuist-tuist-runner-pool-linux-4vcpu-16gb dispatch_label=tuist-staging-linux→ polling Pod claimed → JIT minted carrying the original label → GitHub bound and the job’snproc/MemTotalassertions ran inside the kata microVM
How to test locally
mise run db:resetthenmise run dev, log in as the seeded test user- Visit
/<account>/runners/profiles— three seeded profiles per account - New profile: type a name (note the
tuist-prefix), pick platform + resources (checkmarks on the selected items), save; edit one (name disabled, resources mutable); delete one - In a workflow, set
runs-on: tuist-<name>and trigger a job; the webhook enqueues against the matching shape pool
🤖 Generated with Claude Code