Why
Tuist already runs its own compute fleet on Kubernetes. Today it serves a single workload: CI runners. The fleet has two pools, both schedulable through the standard Kubernetes API: a Linux pool of standard container nodes, and a macOS pool of Mac mini nodes where Pods boot as full macOS VMs with Xcode preinstalled.
Both pools already produce shapes through Tuist.Runners.Catalog. The same substrate (Catalog, image bases, scheduler primitives, Postgres+ClickHouse lifecycle) can host a second workload: long-lived, account-scoped sandbox environments. In v1 sandboxes run on a dedicated subset of nodes within each pool, isolated from CI capacity.
This spec proposes exposing that substrate as a sandbox API. Customers create sandboxes through REST and the CLI; AI agents drive them through MCP. The first consumer is Hive (see below), running under the tuist account.
The conceptual delta from runners to sandboxes is small. Runners are single-shot JIT-claimed GitHub workers. Sandboxes are interactive, alive until deadline or explicit destroy. The per-account profile model (Tuist.Runners.Profiles), the claim primitive (Tuist.Runners.Claims), and the append-only session ledger (Tuist.Runners.RunnerSessions) translate one-to-one.
The macOS+Xcode pool is also a differentiator for outbound-driven workers (e.g. Anthropic Managed Agents, where every external provider is Linux-only); see Layers.
First use case: agentic coding workflows in Hive
The sandbox API is the execution substrate for Hive’s agentic coding workflows. Hive orchestrates product specs and hosts connected agents (Claude Code, in-process workers later). It cannot yet run code agents produce against a real toolchain; the loop closes when the agent can invoke the build, observe output, iterate, and check in. Android/Elixir = Linux; iOS/macOS + customer Apple = macOS+Xcode.
Hive itself is the customer. It calls the sandbox endpoints under the tuist account (a Tuist-internal service, so its usage is billed and quota-capped against Tuist itself, not any end-user account). Concretely:
Hive picks up a task, selects a profile by platform, POSTs /api/v1/sandboxes with a Tuist-owned account agent token. The pod boots on the matching pool with the repo cloned; the agent runs a tight POST /exec loop (install, build, parse, edit via PUT /files/*, retry). On convergence the agent opens a PR, Hive DELETEs, and sandbox_sessions records the billable interval under tuist.
Reference: programmatic APIs we learn from
Surveyed E2B, Daytona, Cloudflare, Vercel. Convergent verbs: create, exec, file I/O, expose port, snapshot, destroy, shell. Distinct capabilities tracked as roadmap items: code-interpreter abstractions, persistent volumes, fs watching, outbound credential controls.
Goals
- A REST surface (
/api/v1/sandboxes/*) to create, exec into, file-sync, port-expose, and destroy ephemeral sandboxes on our own fleet.
- Linux and macOS as first-class platforms, scheduling on a dedicated sandbox pool in v1 (a cordoned subset of nodes within the existing fleet, isolated from CI capacity). Share with CI later when sandboxes are proven stable and external demand justifies the risk.
- Capacity-planning signal: piggy-back on the existing runner autoscaler. It already computes
desired = max(claimed+queued, p95_last_hour) + warm-pool floor. Mechanism: an alerter watches for sustained desired > host slots or NoAvailableHost events on ScalewayAppleSiliconMachine CRs and posts to #ops Slack with shortfall + p95 + proposed host quantity. Operator orders manually (Scaleway console + bump runnersFleet.hostCount); no autonomous procurement. (No admission control in v1.)
- Warm-image provisioning on macOS: a small operator-maintained pool of pre-warmed VM images per Catalog shape (OS + Xcode + base deps resident) so
POST /sandboxes resolves in seconds. Cold boot is the fallback when the warm pool is empty. Invisible to customers; gating for macOS viability.
- macOS long-lived guest stability is gating. Today production CI traffic is the only validator. Mechanism: a synthetic harness on a dedicated Mac mini runs M concurrent Tart VMs for the target deadline window (initial 24h) on a representative workload (idle,
xcodebuild bursts, file I/O, preview-port traffic); monitors kernel-panic logs (/Library/Logs/DiagnosticReports/Kernel*), vm_stat memory pressure, guest recovery time. Pass = zero panics, peak memory under ceiling, recovery under T seconds, K consecutive runs. Report posts to the PR. Don’t bypass with a feature-flag rollout.
- Account-scoped profiles reusing
Tuist.Runners.Catalog shapes and the runner_profiles table; profile names imply the platform.
- Append-only session ledger mirroring
runner_sessions so billing math is one query away across both platforms.
- Hive integration under the
tuist account as the inaugural client, proving the API end to end against build loops on both Linux and macOS.
- MCP tools mirroring the REST surface so third-party agents can drive sandboxes the same way Hive does.
- CLI commands (
tuist sandbox create / exec / files / ssh / destroy) generated from OpenAPI; ssh is hand-written and supports IDE Remote-SSH (VS Code, Cursor).
- Idle offload (Linux in v1): when a
ready sandbox sees no POST /exec / SSH / port traffic for 5 min (configurable), suspend (volume snapshot + pod release) and resume on next activity. Compute billing pauses while suspended; small snapshot-storage line accrues. Deadline clock pauses while suspended; 7-day cap bounds lifetime.
- macOS suspend/wake is a gated research goal. Constraint: tart exposes only full pause-copy-resume; macOS Virtualization.framework has no incremental memory snapshots today (Firecracker dirty-page tracking is Linux-only). Mechanism: a benchmark harness on a Mac mini + object storage measures pause-copy-resume cycle time across VM memory sizes (4/8/16 GB), compression (none/zstd), and transfer parallelism. Output: cycle-time-vs-size report + feasibility recommendation. Realistic outcome: macOS suspend/wake may be deferred indefinitely (constraint is upstream). Until then, macOS sandboxes have no idle offload.
- Feature-flag gated (
:sandboxes) so we can stage rollout per account.
Non-goals
- External backends. We run on our own fleet.
- A full IDE or desktop UI inside the sandbox. We expose SSH, exec, files, and port preview.
- Replacing GitHub Actions runners. Runners stay single-shot JIT-claim.
- Cross-region snapshots. Single-region.
- A new billing primitive. Reuse the
runner_sessions shape.
- BYOC or on-prem sandboxes.
- Designing the outbound-driven worker model. The substrate is structured to admit it later; the lifecycle, scheduling, and integration surface for that case are out of scope here.
Reference model (Runners)
The substrate already exists in production for both platforms:
Tuist.Runners.Catalog: shape + Xcode-version source-of-truth from Helm. Reused unchanged.
Tuist.Runners.Profiles: per-account label-to-shape mapping; sandbox profiles get a new kind column.
Tuist.Runners.Claims: the per-account count + advisory-lock primitive is reused; the claim row is not (sandboxes have no shared job to collapse against).
Tuist.Runners.RunnerSessions: append-only (account_id, started_at, ended_at) intervals; same shape for sandbox billing.
Tuist.Kubernetes.Client: cluster control plane. Sandbox pods schedule on a dedicated subset of nodes (sandbox-specific nodeSelector + taint in v1); image base shared, capacity not.
- Sandbox pods share the runner image base; init differs: open SSH, mount scratch volume, report liveness, hold until terminate (vs. runners’ single-shot
actions-runner).
Delta from runners to sandboxes (same delta on both platforms):
- Runners are single-shot. Sandboxes are interactive, alive until deadline.
- Runner concurrency is per-profile per pool. Sandbox concurrency is per-account.
- Runners are billed per workflow_job. Sandboxes are billed per session.
Layers
The capability splits into a provisioning substrate (pod lifecycle on a Catalog shape, repo staging, image + init, scratch volume, billing session, account-scoped concurrency) and inbound surfaces (REST API, SSH broker, preview URLs, MCP). v1 ships all surfaces; Hive drives the substrate exclusively through them. The split leaves room for an outbound-driven consumer without unwinding lifecycle or billing.
Design
Scopes and isolation
A sandbox is owned by an Account. Visibility, billing, quotas, concurrency, and audit live at the account level. Two other fields are deliberately separate from ownership: project_id is a reporting / grouping tag (no isolation or access implications); grants is the access-control field — the caller declares (project_id, scopes) pairs at create time and the server mints short-lived scoped tokens for the sandbox to use (see Security). For Hive: the owner is the tuist account, project_id reports the downstream project, grants requests scoped cache + registry access on the relevant Tuist-owned project.
The dispatcher reads the profile, derives platform (linux or macos), and schedules the pod on a node with capacity in that pool. Both pools are part of the same cluster.
Schema (sketch)
Two new Postgres tables: sandboxes (account_id, project_id, profile_id, platform, status enum [pending, ready, terminating, terminated, failed], ready_at, terminated_at, deadline_at, pod handle, ssh_endpoint, preview_host) and sandbox_sessions (append-only billing intervals mirroring runner_sessions). Sandbox profiles live in runner_profiles with a new kind column. Full DDL, indexes, and the analytics view live in the PR.
API surface
The REST surface below is one interface to the capability; MCP and CLI mirror it. All endpoints under /api/v1/sandboxes. Account agent token auth (same pipeline as v1); authorization via Tuist.Authorization. OpenAPI spec extended next to v1; Swift client regenerated with mise run generate-api-cli-code. Surface is platform-identical; the response includes platform.
POST /api/v1/sandboxes
body: { profile: "tuist-linux-medium" | "tuist-macos-16-large",
deadline_minutes?: 60,
project_id?: uuid, (reporting tag)
grants?: [{ project_id: uuid, scopes: [...] }] }
201: { id, status: "pending", platform, ssh_endpoint: null, deadline_at, grants }
GET /api/v1/sandboxes/:id
GET /api/v1/sandboxes
DELETE /api/v1/sandboxes/:id
POST /api/v1/sandboxes/:id/exec
body: { command, args, env?, stdin?, timeout_seconds? }
200: { exit_code, stdout, stderr }
WS /api/v1/sandboxes/:id/exec/stream (PTY)
GET/PUT/DELETE /api/v1/sandboxes/:id/files/*path
POST /api/v1/sandboxes/:id/ports
body: { port }
201: { url: "https://<sub>.preview.tuist.dev" }
POST /api/v1/sandboxes/:id/ssh_tokens
201: { username, host, port, private_key_pem, expires_at }
(Snapshot is an internal mechanism for idle offload in v1; no customer endpoint. See Lifecycle.)
POST /ssh_tokens mints a short-lived (default 60-minute) credential bound to one sandbox. Customers point their SSH client, IDE Remote-SSH (VS Code, Cursor), or the tuist sandbox ssh CLI at the returned host:port, which terminates on the same brokers.tuist.dev TCP proxy used internally for exec. The token authenticates the SSH session to one specific sandbox pod; pod network stays internal.
MCP tools (Tuist MCP): create_sandbox, exec_in_sandbox, read_sandbox_file, write_sandbox_file, destroy_sandbox (1:1 with REST); wait_sandbox_ready (polling sugar over GET /sandboxes/:id, not its own endpoint). Same auth and quota as REST.
CLI commands generated from the OpenAPI spec: tuist sandbox create / exec / files / destroy. tuist sandbox ssh is hand-written: mints a token via POST /ssh_tokens then execs ssh.
Authentication and authorization
- Account agent tokens authorize sandbox calls (project tokens are deprecated; not accepted here).
Tuist.Authorization enforces feature flag, quota, plan tier.
- SSH to the sandbox pod is brokered via a server-side TCP proxy (
brokers.tuist.dev) keyed by sandbox id + per-sandbox token. Pod network stays internal; no NodePorts, no per-sandbox public IPs. The broker is platform-agnostic.
Security
- Isolation posture: same as runners. Linux pods use the default Kubernetes runtime (runc); macOS pods are full VMs. Runner threat model already covers untrusted user-supplied code. If sandbox-specific threats emerge, Linux can upgrade to gVisor or Kata; macOS is already as strong as we can get.
- In-sandbox authentication to Tuist services: handled exclusively through
grants on POST /sandboxes. The server mints one short-lived account agent token per grant, scope-restricted to (project_id, scopes), lifetime bounded by the sandbox deadline, mounted at a documented path the Tuist CLI reads. v1: every grant must reference a project owned by the same account as the sandbox.
- SSH broker tokens: per-sandbox, short-lived, high-entropy, never logged. A leak compromises one sandbox for the token’s lifetime.
- Preview URLs: HTTP services exposed via
POST /ports go through a Tuist-authenticated wrapper, not raw subdomain access. Egress from preview-served pages is contained so the sandbox cannot SSRF into internal Tuist networks.
- Audit: each grant logs as “sandbox
<id> received scopes [...] on project <id> via caller <id>,” surfaced to project owners.
Feature gating
Tuist.FeatureFlags.sandboxes_enabled?(account) gates the surface.
- Per-account
sandbox_concurrent_limit (default 1; raised for the tuist account so Hive can run many parallel agents).
- Per-account
sandbox_monthly_minutes (soft cap; for dashboards).
- Optional per-account allowlist by platform.
Telemetry and billing
Reuse the per-account billing and concurrency primitives from Tuist.Runners. Sandbox sessions are append-only intervals tagged with platform; per-platform cost dashboards drop out of the same query. (Open/close anchors and the per-kind session-lifetime clamp are spelled out in Lifecycle.) Telemetry under [:tuist, :sandbox, ...] carries platform; Prometheus exposes provisioning-time, exec duration, and active-sandbox gauges per account and platform.
Lifecycle and state machines
Sandbox
| From -> To |
Trigger |
Driver |
Failure mode |
Billing / state rule |
| (none) -> pending |
POST /sandboxes |
Sandboxes.create |
quota/flag rejection -> 4xx |
no session row yet |
| pending -> ready |
warm-pool claim or cold-boot complete; SSH up |
Provisioner |
timeout -> failed |
session opens at ready_at; provisioning kind logged (warm_hit / cold_boot) |
| pending -> failed |
provisioning timeout / pod create / image pull |
Provisioner |
terminal |
no session opens |
| ready -> terminating |
DELETE |
Sandboxes.destroy |
none |
session.ended_at = now(); reason explicit_delete |
| ready -> terminating |
deadline_at < now() |
Reaper (Oban cron) |
none |
session.ended_at = now(); reason deadline |
| ready -> terminating |
observed pod terminal phase |
Sandbox Pod Watcher (k8s informer + /api/internal/sandboxes/pods/stopped, mirrors runner controller) |
abnormal-exit handler |
session.ended_at = pod.finishedAt; reason: oom_killed/node_loss/guest_crash/etc. |
| terminating -> terminated |
pod delete ack |
Terminator (Oban) |
transient k8s errors retry; persistent failures alert |
session already closed in the From transition |
| ready -> suspending (Linux v1) |
idle: no exec / SSH / port for T (default 5 min) |
Idle Detector (Oban) |
failure -> ready, alert |
session continues |
| suspending -> suspended |
snapshot stored; pod released |
Suspender (Oban) |
failure -> ready |
session.ended_at=now(); reason idle_offload; storage line opens |
| suspended -> restoring |
activity (exec / SSH / port) |
Sandboxes.wake |
snapshot expired -> failed |
storage line continues |
| restoring -> ready |
snapshot restored; SSH up |
Provisioner |
timeout -> failed |
storage line closes; new session row opens; deadline clock resumes |
| suspended -> terminating |
deadline / DELETE / 7-day cap |
Reaper / destroy |
none |
storage line closes |
Suspend/wake fidelity
| Platform |
Mechanism |
Captured |
On wake |
v1 status |
| Linux |
volume snapshot (CSI; CoW) |
persistent volume contents |
files restored; processes gone; agent restarts fresh |
shipped |
| macOS |
would be full-VM via tart pause-copy-resume (memory + processes + filesystem) |
exact paused state; processes resume |
gated (Goal 12); multi-GB memory transfer to/from object storage is the unproven cost. |
|
Linux suspension is non-disruptive (CoW). macOS suspension is not in v1; if it ships, the VM pauses during the dump (ready -> suspending, not in-place).
Suspension snapshot storage
Suspension snapshots live in object storage (S3-compatible). One per suspended sandbox, bounded by the sandbox’s effective deadline and the 7-day absolute cap. Storage is a separate billing line that accrues only while suspended.
Warm pool
Per-shape warm_replicas. Controller keeps N pre-bootstrapped pods per shape in the dedicated pool. POST /sandboxes atomically claims a warm pod (Postgres UPDATE ... RETURNING); per-sandbox bootstrap (repo clone, grants mount) runs in the warm pod; ready_at lands in seconds. Miss -> cold boot. Replenishment is background. Image rolls drain existing warm pods (serve until claimed, then retire). Restore (suspended -> ready) does NOT use the warm pool: it needs the suspended sandbox’s state, not a blank base image.
Open questions
- SSH broker design: single multi-tenant TCP broker, per-sandbox token. Open: rotation cadence, broker split per pool/pod.
- Hive-to-Tuist auth: does Hive impersonate the requesting user in audit? Cleaner for per-user audit, needs token-exchange. Start with no impersonation; logs attribute to bot.
- Quota model: concurrent cap plus dashboard-only monthly cap, or server-enforced monthly compute-minutes ceiling?
- Pricing surface: separate Billing line item or rolled into runners line? Per-platform tier or flat?
- Egress policy: unrestricted egress (matches external vendors and what agents need) or stable egress via the stable-egress-controller with per-account opt-in?
- Cross-account grants: v1 restricts every grant to same-account. Future version: project owners pre-authorize specific accounts (e.g.
tuist) to mint scoped tokens; opt-in UI, scope review, revocation in a follow-up spec.
Done when
Under the tuist account, Hive can drive a sandbox end-to-end on either platform: create with profile, exec, file I/O, expose port, destroy. sandbox_sessions records the interval. The Sandbox Pod Watcher closes the session on any pod terminal phase; the reaper terminates deadline-exceeded sandboxes within one cron tick.
A Hive coding-loop workflow takes a spec, spins a sandbox under tuist on the matching platform, drives a build loop, opens a PR, closes the sandbox; no human in the loop.
Third-party agents drive the same flow via the Tuist MCP tools.
tuist sandbox ssh <id> + VS Code Remote SSH work via credentials from POST /ssh_tokens.
When desired Pods exceed host capacity for K intervals, the alerter posts to #ops and the operator orders capacity. macOS sandboxes pass the soak (zero kernel panics, recovery under T seconds, K consecutive runs).
A Linux sandbox idle for 5 minutes auto-suspends (snapshot stored, pod released, compute and deadline paused) and wakes on next activity. The macOS suspend/wake prototype produces a cycle-time report + feasibility recommendation; the realistic outcome may be deferred indefinitely.