Hive Hive
Sign in

Account-scoped sandbox API

#2 · Public · Created directly

Proposed
Proposal

Why

Tuist already runs its own compute fleet on Kubernetes. Today it serves a single workload: CI runners. The fleet has two pools, both schedulable through the standard Kubernetes API: a Linux pool of standard container nodes, and a macOS pool of Mac mini nodes where Pods boot as full macOS VMs with Xcode preinstalled.

Both pools already produce shapes through Tuist.Runners.Catalog. The same substrate (Catalog, image bases, scheduler primitives, Postgres+ClickHouse lifecycle) can host a second workload: long-lived, account-scoped sandbox environments. In v1 sandboxes run on a dedicated subset of nodes within each pool, isolated from CI capacity.

This spec proposes exposing that substrate as a sandbox API. Customers create sandboxes through REST and the CLI; AI agents drive them through MCP. The first consumer is Hive (see below), running under the tuist account.

The conceptual delta from runners to sandboxes is small. Runners are single-shot JIT-claimed GitHub workers. Sandboxes are interactive, alive until deadline or explicit destroy. The per-account profile model (Tuist.Runners.Profiles), the claim primitive (Tuist.Runners.Claims), and the append-only session ledger (Tuist.Runners.RunnerSessions) translate one-to-one.

The macOS+Xcode pool is also a differentiator for outbound-driven workers (e.g. Anthropic Managed Agents, where every external provider is Linux-only); see Layers.

First use case: agentic coding workflows in Hive

The sandbox API is the execution substrate for Hive’s agentic coding workflows. Hive orchestrates product specs and hosts connected agents (Claude Code, in-process workers later). It cannot yet run code agents produce against a real toolchain; the loop closes when the agent can invoke the build, observe output, iterate, and check in. Android/Elixir = Linux; iOS/macOS + customer Apple = macOS+Xcode.

Hive itself is the customer. It calls the sandbox endpoints under the tuist account (a Tuist-internal service, so its usage is billed and quota-capped against Tuist itself, not any end-user account). Concretely:

Hive picks up a task, selects a profile by platform, POSTs /api/v1/sandboxes with a Tuist-owned account agent token. The pod boots on the matching pool with the repo cloned; the agent runs a tight POST /exec loop (install, build, parse, edit via PUT /files/*, retry). On convergence the agent opens a PR, Hive DELETEs, and sandbox_sessions records the billable interval under tuist.

Reference: programmatic APIs we learn from

Surveyed E2B, Daytona, Cloudflare, Vercel. Convergent verbs: create, exec, file I/O, expose port, snapshot, destroy, shell. Distinct capabilities tracked as roadmap items: code-interpreter abstractions, persistent volumes, fs watching, outbound credential controls.

Goals

  1. A REST surface (/api/v1/sandboxes/*) to create, exec into, file-sync, port-expose, and destroy ephemeral sandboxes on our own fleet.
  2. Linux and macOS as first-class platforms, scheduling on a dedicated sandbox pool in v1 (a cordoned subset of nodes within the existing fleet, isolated from CI capacity). Share with CI later when sandboxes are proven stable and external demand justifies the risk.
  3. Capacity-planning signal: piggy-back on the existing runner autoscaler. It already computes desired = max(claimed+queued, p95_last_hour) + warm-pool floor. Mechanism: an alerter watches for sustained desired > host slots or NoAvailableHost events on ScalewayAppleSiliconMachine CRs and posts to #ops Slack with shortfall + p95 + proposed host quantity. Operator orders manually (Scaleway console + bump runnersFleet.hostCount); no autonomous procurement. (No admission control in v1.)
  4. Warm-image provisioning on macOS: a small operator-maintained pool of pre-warmed VM images per Catalog shape (OS + Xcode + base deps resident) so POST /sandboxes resolves in seconds. Cold boot is the fallback when the warm pool is empty. Invisible to customers; gating for macOS viability.
  5. macOS long-lived guest stability is gating. Today production CI traffic is the only validator. Mechanism: a synthetic harness on a dedicated Mac mini runs M concurrent Tart VMs for the target deadline window (initial 24h) on a representative workload (idle, xcodebuild bursts, file I/O, preview-port traffic); monitors kernel-panic logs (/Library/Logs/DiagnosticReports/Kernel*), vm_stat memory pressure, guest recovery time. Pass = zero panics, peak memory under ceiling, recovery under T seconds, K consecutive runs. Report posts to the PR. Don’t bypass with a feature-flag rollout.
  6. Account-scoped profiles reusing Tuist.Runners.Catalog shapes and the runner_profiles table; profile names imply the platform.
  7. Append-only session ledger mirroring runner_sessions so billing math is one query away across both platforms.
  8. Hive integration under the tuist account as the inaugural client, proving the API end to end against build loops on both Linux and macOS.
  9. MCP tools mirroring the REST surface so third-party agents can drive sandboxes the same way Hive does.
  10. CLI commands (tuist sandbox create / exec / files / ssh / destroy) generated from OpenAPI; ssh is hand-written and supports IDE Remote-SSH (VS Code, Cursor).
  11. Idle offload (Linux in v1): when a ready sandbox sees no POST /exec / SSH / port traffic for 5 min (configurable), suspend (volume snapshot + pod release) and resume on next activity. Compute billing pauses while suspended; small snapshot-storage line accrues. Deadline clock pauses while suspended; 7-day cap bounds lifetime.
  12. macOS suspend/wake is a gated research goal. Constraint: tart exposes only full pause-copy-resume; macOS Virtualization.framework has no incremental memory snapshots today (Firecracker dirty-page tracking is Linux-only). Mechanism: a benchmark harness on a Mac mini + object storage measures pause-copy-resume cycle time across VM memory sizes (4/8/16 GB), compression (none/zstd), and transfer parallelism. Output: cycle-time-vs-size report + feasibility recommendation. Realistic outcome: macOS suspend/wake may be deferred indefinitely (constraint is upstream). Until then, macOS sandboxes have no idle offload.
  13. Feature-flag gated (:sandboxes) so we can stage rollout per account.

Non-goals

  • External backends. We run on our own fleet.
  • A full IDE or desktop UI inside the sandbox. We expose SSH, exec, files, and port preview.
  • Replacing GitHub Actions runners. Runners stay single-shot JIT-claim.
  • Cross-region snapshots. Single-region.
  • A new billing primitive. Reuse the runner_sessions shape.
  • BYOC or on-prem sandboxes.
  • Designing the outbound-driven worker model. The substrate is structured to admit it later; the lifecycle, scheduling, and integration surface for that case are out of scope here.

Reference model (Runners)

The substrate already exists in production for both platforms:

  • Tuist.Runners.Catalog: shape + Xcode-version source-of-truth from Helm. Reused unchanged.
  • Tuist.Runners.Profiles: per-account label-to-shape mapping; sandbox profiles get a new kind column.
  • Tuist.Runners.Claims: the per-account count + advisory-lock primitive is reused; the claim row is not (sandboxes have no shared job to collapse against).
  • Tuist.Runners.RunnerSessions: append-only (account_id, started_at, ended_at) intervals; same shape for sandbox billing.
  • Tuist.Kubernetes.Client: cluster control plane. Sandbox pods schedule on a dedicated subset of nodes (sandbox-specific nodeSelector + taint in v1); image base shared, capacity not.
  • Sandbox pods share the runner image base; init differs: open SSH, mount scratch volume, report liveness, hold until terminate (vs. runners’ single-shot actions-runner).

Delta from runners to sandboxes (same delta on both platforms):

  • Runners are single-shot. Sandboxes are interactive, alive until deadline.
  • Runner concurrency is per-profile per pool. Sandbox concurrency is per-account.
  • Runners are billed per workflow_job. Sandboxes are billed per session.

Layers

The capability splits into a provisioning substrate (pod lifecycle on a Catalog shape, repo staging, image + init, scratch volume, billing session, account-scoped concurrency) and inbound surfaces (REST API, SSH broker, preview URLs, MCP). v1 ships all surfaces; Hive drives the substrate exclusively through them. The split leaves room for an outbound-driven consumer without unwinding lifecycle or billing.

Design

Scopes and isolation

A sandbox is owned by an Account. Visibility, billing, quotas, concurrency, and audit live at the account level. Two other fields are deliberately separate from ownership: project_id is a reporting / grouping tag (no isolation or access implications); grants is the access-control field — the caller declares (project_id, scopes) pairs at create time and the server mints short-lived scoped tokens for the sandbox to use (see Security). For Hive: the owner is the tuist account, project_id reports the downstream project, grants requests scoped cache + registry access on the relevant Tuist-owned project.

The dispatcher reads the profile, derives platform (linux or macos), and schedules the pod on a node with capacity in that pool. Both pools are part of the same cluster.

Schema (sketch)

Two new Postgres tables: sandboxes (account_id, project_id, profile_id, platform, status enum [pending, ready, terminating, terminated, failed], ready_at, terminated_at, deadline_at, pod handle, ssh_endpoint, preview_host) and sandbox_sessions (append-only billing intervals mirroring runner_sessions). Sandbox profiles live in runner_profiles with a new kind column. Full DDL, indexes, and the analytics view live in the PR.

API surface

The REST surface below is one interface to the capability; MCP and CLI mirror it. All endpoints under /api/v1/sandboxes. Account agent token auth (same pipeline as v1); authorization via Tuist.Authorization. OpenAPI spec extended next to v1; Swift client regenerated with mise run generate-api-cli-code. Surface is platform-identical; the response includes platform.

POST /api/v1/sandboxes
body: { profile: "tuist-linux-medium" | "tuist-macos-16-large",
deadline_minutes?: 60,
project_id?: uuid, (reporting tag)
grants?: [{ project_id: uuid, scopes: [...] }] }
201: { id, status: "pending", platform, ssh_endpoint: null, deadline_at, grants }
GET /api/v1/sandboxes/:id
GET /api/v1/sandboxes
DELETE /api/v1/sandboxes/:id
POST /api/v1/sandboxes/:id/exec
body: { command, args, env?, stdin?, timeout_seconds? }
200: { exit_code, stdout, stderr }
WS /api/v1/sandboxes/:id/exec/stream (PTY)
GET/PUT/DELETE /api/v1/sandboxes/:id/files/*path
POST /api/v1/sandboxes/:id/ports
body: { port }
201: { url: "https://<sub>.preview.tuist.dev" }
POST /api/v1/sandboxes/:id/ssh_tokens
201: { username, host, port, private_key_pem, expires_at }

(Snapshot is an internal mechanism for idle offload in v1; no customer endpoint. See Lifecycle.)

POST /ssh_tokens mints a short-lived (default 60-minute) credential bound to one sandbox. Customers point their SSH client, IDE Remote-SSH (VS Code, Cursor), or the tuist sandbox ssh CLI at the returned host:port, which terminates on the same brokers.tuist.dev TCP proxy used internally for exec. The token authenticates the SSH session to one specific sandbox pod; pod network stays internal.

MCP tools (Tuist MCP): create_sandbox, exec_in_sandbox, read_sandbox_file, write_sandbox_file, destroy_sandbox (1:1 with REST); wait_sandbox_ready (polling sugar over GET /sandboxes/:id, not its own endpoint). Same auth and quota as REST.

CLI commands generated from the OpenAPI spec: tuist sandbox create / exec / files / destroy. tuist sandbox ssh is hand-written: mints a token via POST /ssh_tokens then execs ssh.

Authentication and authorization

  • Account agent tokens authorize sandbox calls (project tokens are deprecated; not accepted here).
  • Tuist.Authorization enforces feature flag, quota, plan tier.
  • SSH to the sandbox pod is brokered via a server-side TCP proxy (brokers.tuist.dev) keyed by sandbox id + per-sandbox token. Pod network stays internal; no NodePorts, no per-sandbox public IPs. The broker is platform-agnostic.

Security

  • Isolation posture: same as runners. Linux pods use the default Kubernetes runtime (runc); macOS pods are full VMs. Runner threat model already covers untrusted user-supplied code. If sandbox-specific threats emerge, Linux can upgrade to gVisor or Kata; macOS is already as strong as we can get.
  • In-sandbox authentication to Tuist services: handled exclusively through grants on POST /sandboxes. The server mints one short-lived account agent token per grant, scope-restricted to (project_id, scopes), lifetime bounded by the sandbox deadline, mounted at a documented path the Tuist CLI reads. v1: every grant must reference a project owned by the same account as the sandbox.
  • SSH broker tokens: per-sandbox, short-lived, high-entropy, never logged. A leak compromises one sandbox for the token’s lifetime.
  • Preview URLs: HTTP services exposed via POST /ports go through a Tuist-authenticated wrapper, not raw subdomain access. Egress from preview-served pages is contained so the sandbox cannot SSRF into internal Tuist networks.
  • Audit: each grant logs as “sandbox <id> received scopes [...] on project <id> via caller <id>,” surfaced to project owners.

Feature gating

  • Tuist.FeatureFlags.sandboxes_enabled?(account) gates the surface.
  • Per-account sandbox_concurrent_limit (default 1; raised for the tuist account so Hive can run many parallel agents).
  • Per-account sandbox_monthly_minutes (soft cap; for dashboards).
  • Optional per-account allowlist by platform.

Telemetry and billing

Reuse the per-account billing and concurrency primitives from Tuist.Runners. Sandbox sessions are append-only intervals tagged with platform; per-platform cost dashboards drop out of the same query. (Open/close anchors and the per-kind session-lifetime clamp are spelled out in Lifecycle.) Telemetry under [:tuist, :sandbox, ...] carries platform; Prometheus exposes provisioning-time, exec duration, and active-sandbox gauges per account and platform.

Lifecycle and state machines

Sandbox

From -> To Trigger Driver Failure mode Billing / state rule
(none) -> pending POST /sandboxes Sandboxes.create quota/flag rejection -> 4xx no session row yet
pending -> ready warm-pool claim or cold-boot complete; SSH up Provisioner timeout -> failed session opens at ready_at; provisioning kind logged (warm_hit / cold_boot)
pending -> failed provisioning timeout / pod create / image pull Provisioner terminal no session opens
ready -> terminating DELETE Sandboxes.destroy none session.ended_at = now(); reason explicit_delete
ready -> terminating deadline_at < now() Reaper (Oban cron) none session.ended_at = now(); reason deadline
ready -> terminating observed pod terminal phase Sandbox Pod Watcher (k8s informer + /api/internal/sandboxes/pods/stopped, mirrors runner controller) abnormal-exit handler session.ended_at = pod.finishedAt; reason: oom_killed/node_loss/guest_crash/etc.
terminating -> terminated pod delete ack Terminator (Oban) transient k8s errors retry; persistent failures alert session already closed in the From transition
ready -> suspending (Linux v1) idle: no exec / SSH / port for T (default 5 min) Idle Detector (Oban) failure -> ready, alert session continues
suspending -> suspended snapshot stored; pod released Suspender (Oban) failure -> ready session.ended_at=now(); reason idle_offload; storage line opens
suspended -> restoring activity (exec / SSH / port) Sandboxes.wake snapshot expired -> failed storage line continues
restoring -> ready snapshot restored; SSH up Provisioner timeout -> failed storage line closes; new session row opens; deadline clock resumes
suspended -> terminating deadline / DELETE / 7-day cap Reaper / destroy none storage line closes

Suspend/wake fidelity

Platform Mechanism Captured On wake v1 status
Linux volume snapshot (CSI; CoW) persistent volume contents files restored; processes gone; agent restarts fresh shipped
macOS would be full-VM via tart pause-copy-resume (memory + processes + filesystem) exact paused state; processes resume gated (Goal 12); multi-GB memory transfer to/from object storage is the unproven cost.

Linux suspension is non-disruptive (CoW). macOS suspension is not in v1; if it ships, the VM pauses during the dump (ready -> suspending, not in-place).

Suspension snapshot storage

Suspension snapshots live in object storage (S3-compatible). One per suspended sandbox, bounded by the sandbox’s effective deadline and the 7-day absolute cap. Storage is a separate billing line that accrues only while suspended.

Warm pool

Per-shape warm_replicas. Controller keeps N pre-bootstrapped pods per shape in the dedicated pool. POST /sandboxes atomically claims a warm pod (Postgres UPDATE ... RETURNING); per-sandbox bootstrap (repo clone, grants mount) runs in the warm pod; ready_at lands in seconds. Miss -> cold boot. Replenishment is background. Image rolls drain existing warm pods (serve until claimed, then retire). Restore (suspended -> ready) does NOT use the warm pool: it needs the suspended sandbox’s state, not a blank base image.

Open questions

  • SSH broker design: single multi-tenant TCP broker, per-sandbox token. Open: rotation cadence, broker split per pool/pod.
  • Hive-to-Tuist auth: does Hive impersonate the requesting user in audit? Cleaner for per-user audit, needs token-exchange. Start with no impersonation; logs attribute to bot.
  • Quota model: concurrent cap plus dashboard-only monthly cap, or server-enforced monthly compute-minutes ceiling?
  • Pricing surface: separate Billing line item or rolled into runners line? Per-platform tier or flat?
  • Egress policy: unrestricted egress (matches external vendors and what agents need) or stable egress via the stable-egress-controller with per-account opt-in?
  • Cross-account grants: v1 restricts every grant to same-account. Future version: project owners pre-authorize specific accounts (e.g. tuist) to mint scoped tokens; opt-in UI, scope review, revocation in a follow-up spec.

Done when

Under the tuist account, Hive can drive a sandbox end-to-end on either platform: create with profile, exec, file I/O, expose port, destroy. sandbox_sessions records the interval. The Sandbox Pod Watcher closes the session on any pod terminal phase; the reaper terminates deadline-exceeded sandboxes within one cron tick.

A Hive coding-loop workflow takes a spec, spins a sandbox under tuist on the matching platform, drives a build loop, opens a PR, closes the sandbox; no human in the loop.

Third-party agents drive the same flow via the Tuist MCP tools.

tuist sandbox ssh <id> + VS Code Remote SSH work via credentials from POST /ssh_tokens.

When desired Pods exceed host capacity for K intervals, the alerter posts to #ops and the operator orders capacity. macOS sandboxes pass the soak (zero kernel panics, recovery under T seconds, K consecutive runs).

A Linux sandbox idle for 5 minutes auto-suspends (snapshot stored, pod released, compute and deadline paused) and wakes on next activity. The macOS suspend/wake prototype produces a cycle-time report + feasibility recommendation; the realistic outcome may be deferred indefinitely.

Draft history
Revision Status Edited
Revision 19 Edited by pedro@tuist.dev
Proposed
Revision 18 Edited by pedro@tuist.dev
Proposed
Revision 17 Edited by pedro@tuist.dev
Proposed
Revision 16 Edited by pedro@tuist.dev
Proposed
Revision 15 Edited by pedro@tuist.dev
Proposed
Revision 14 Edited by pedro@tuist.dev
Proposed
Revision 13 Edited by pedro@tuist.dev
Proposed
Revision 12 Edited by pedro@tuist.dev
Proposed
Revision 11 Edited by pedro@tuist.dev
Proposed
Revision 10 Edited by pedro@tuist.dev
Proposed
Revision 9 Edited by pedro@tuist.dev
Proposed
Revision 8 Edited by pedro@tuist.dev
Proposed
Revision 7 Edited by pedro@tuist.dev
Draft
Revision 6 Edited by pedro@tuist.dev
Draft
Revision 5 Edited by pedro@tuist.dev
Draft
Revision 4 Edited by pedro@tuist.dev
Draft
Revision 3 Edited by pedro@tuist.dev
Draft
Revision 2 Edited by pedro@tuist.dev
Draft
Revision 1 Edited by pedro@tuist.dev
Draft
Comments
M
marek@tuist.dev Jun 16, 2026

(Review 1/2)

Reviewed end to end. Strong spec and the core thesis is right: the runner fleet is a substrate, and sandboxes are a second workload with a small conceptual delta. Most of my feedback is structural (altitude + one forward-compatibility constraint), with a handful of correctness concerns in billing, capacity, and isolation that hold regardless of how the spec is framed.

1. Raise the altitude. A lot of this is implementation detail that will rot the moment the code diverges: the file-by-file Layout block, the exact Ecto DDL and indexes, the ClickHouse engine choice, the module-path list, and the lifecycle pseudocode. I’d move all of that to the PR and keep the spec to: Why, the runner→sandbox delta, Goals/Non-goals, the API surface as a contract, the open questions, and Done-when. That alone roughly halves the length and puts the open questions at the center, which is where a draft should sit.

2. Keep the substrate separable from the inbound API (build for a future outbound consumer, don’t scope it now). The whole spec assumes a caller-driven model: someone calls inbound and drives a POST /exec loop over our SSH broker. That’s the right v1. But I’d keep the provisioning substrate (pod lifecycle, repo staging, image, billing session) cleanly decoupled from the inbound REST/exec/SSH/preview surface, so we don’t bake in the assumption that execution is always driven inbound. The concrete future consumer I have in mind is Anthropic’s self-hosted sandboxes for Claude Managed Agents: there Anthropic owns the agent loop and you run a worker that connects outbound, claims a session from their queue, downloads skills, and drives execution from inside the sandbox, with no inbound exec, broker, or preview URL involved. We don’t need to design that now, but if the lifecycle and billing assume the broker/preview path is always present, we’ll be unwinding it to support that later. (It’s also where our macOS + Xcode pool is a real differentiator, since every provider Anthropic lists for this is Linux. Worth a one-line nod in Why, not a use case in scope.)

3. Promote VM snapshots / warm images to a Goal. Today they’re uncommitted API stubs (in the surface, absent from Goals/Done-when, parked as an open question). They should be a goal, scoped per platform, justified by provisioning latency: macOS cold boot is minutes, which neither an interactive session nor any agent loop tolerates per sandbox. Restoring from a pre-warmed image (Xcode + deps resident) is what makes the macOS path viable, and it doubles as the capacity-reclaim lever in (7). (Note this is a full-VM snapshot on macOS vs a volume/fs snapshot on Linux: different capabilities under one word, pick which we promise per platform.)

4. Billing correctness (survives any altitude). Two real decisions, not implementation detail:

  • Open anchor. runner_sessions opens at claim-win specifically to keep warm-pool idle out of billing. The sandbox flow opens at create (status pending, before the pod is ready), so we’d bill provisioning and boot, which on macOS is minutes. The schema already has ready_at; decide bill-from-ready vs bill-from-create deliberately, especially for the customer accounts behind the flag.
  • Close trigger. Runner sessions are closed by the controller observing a terminal Pod phase (finishedAt), covering every abnormal exit. The sandbox design closes from the Terminator after a DELETE, with the reaper covering deadline. A pod that dies outside a DELETE (OOM, node loss, guest crash, all of which we see on this fleet) leaves an open session billing until the 6h clamp between cron ticks. Sandboxes need an equivalent reconcile/observer.
M
marek@tuist.dev Jun 16, 2026

(Review 2/2, continued)

5. The “reuse verbatim” claims are overstated. Tuist.Runners.Billing.interval_intersection/3 doesn’t exist; the real surface is compute_milliseconds/4 (which already takes a :platform scope opt, so the per-platform dashboard claim is genuinely free). Its @max_session_lifetime_seconds = 6h clamp is a runner safety net hard-coded as a module attribute; reused as-is it silently caps sandbox billing and needs to be per-kind once deadlines approach it. And Tuist.Runners.Claims is keyed on workflow_job_id with ON CONFLICT to collapse pollers racing one shared job; sandboxes have no shared job, so the reusable part is the per-account count + advisory lock, not the claim row. At higher altitude, just say “reuse the per-account billing + concurrency primitives” without naming arities.

6. Isolation / multi-tenancy is under-specified. Sandboxes run untrusted code by design, and the provisioner mounts cache + registry creds into the pod, a larger exfiltration surface than runners (which already defend the SA-token-reuse path). State the isolation posture (Linux: runc/gVisor/Kata? macOS: full VM, strong), scope mounted tokens least-privilege + short-TTL, and note that project_id is described as “just a reporting tag” but actually gates credential mounting: a tuist-account sandbox holding a customer project’s creds is a cross-account credential flow that needs an explicit grant model. Also: the single multi-tenant SSH broker is framed as an availability SPOF, but a leaked/guessable token is a confidentiality blast radius across sandboxes; and public preview URLs expose untrusted-code services to the internet (auth + SSRF back into the internal network).

7. Capacity is a risk, not just an open question. A sandbox pins a pool slot for its whole deadline; a runner boots-runs-exits in minutes. And the fleet is pre-bought bare metal with a fixed slot ceiling, there’s no autoscaler to absorb the load, so a long-lived hold is a direct and lasting subtraction from CI capacity, with the only relief being to buy and provision more hardware. The macOS pool is especially tight (~2 VMs per Mac mini, so the ceiling is small and hard), and a few parallel macOS sandboxes contend head-to-head with CI for a fixed number of slots; the Linux pool is already reservation-bound with a guest-OOM history. I’d treat macOS long-lived-guest-stability as a gating item for that path, and admission control / fair-sharing between runner and sandbox demand (plus a capacity-planning signal for when to add hardware) as a goal rather than an open question.

8. Smaller. tuist sandbox ssh is in Goals and Done-when but missing from the Design CLI list, and it’s hand-written (mint token + shell out), not OpenAPI-generated, so call that out. MCP wait_sandbox_ready has no REST mirror (it’s client-side polling sugar, fine, just not a literal mirror). The ClickHouse ReplacingMergeTree for an append-only event stream looks wrong (it collapses rows on the sort key), though this disappears if the engine choice is trimmed per (1).

What’s good and should stay: the two-pool-one-substrate framing, the runner→sandbox delta table, the partial reaper index, the vendor survey and its “convergent verbs” synthesis, and the flag + per-account-cap rollout path.

P
pedro@tuist.dev Jun 16, 2026

@marek thanks, solid review. Worked through all eight points in revs 8-15.

(1) altitude cut: Layout, DDL, Lifecycle pseudo, CH engine all gone. (2) new Layers section splits provisioning substrate from inbound surfaces. Outbound-driven workers (Anthropic Managed Agents) called out in Why + Non-goals. (3) warm-image provisioning and customer snapshot/restore promoted to Goals, with per-platform mechanism documented. (4) billing opens at ready_at, closes on observed pod terminal phase (covers OOM/node loss/guest crash), session-lifetime clamp goes per-kind. (5) “reuse verbatim” softened; Claims-row vs primitive distinction made explicit. (6) new Security subsection. grants is the access-control field, project_id is metadata, account agent tokens replace project tokens. v1 keeps grants same-account; cross-account becomes the open question. (7) admission control (runners win by default) and macOS guest stability promoted from open questions to Goals. (8) tuist sandbox ssh flagged as hand-written, wait_sandbox_ready clarified as polling sugar, ReplacingMergeTree gone with the altitude cut.

Would appreciate another pass.

P
pedro@tuist.dev Jun 16, 2026

Apologies, the previous comment rendered without line breaks. Same content, properly formatted:

Worked through all eight points in revs 8-15.

  1. Altitude cut: Layout, DDL, Lifecycle pseudo, CH engine all gone.
  2. New Layers section splits provisioning substrate from inbound surfaces. Outbound-driven workers (Anthropic Managed Agents) called out in Why + Non-goals.
  3. Warm-image provisioning and customer snapshot/restore promoted to Goals, with per-platform mechanism documented.
  4. Billing opens at ready_at, closes on observed pod terminal phase (covers OOM/node loss/guest crash); session-lifetime clamp goes per-kind.
  5. “Reuse verbatim” softened; Claims-row vs primitive distinction made explicit.
  6. New Security subsection. grants is the access-control field, project_id is metadata, account agent tokens replace project tokens. v1 keeps grants same-account; cross-account becomes the open question.
  7. Admission control (runners win by default) and macOS guest stability promoted from open questions to Goals.
  8. tuist sandbox ssh flagged as hand-written, wait_sandbox_ready clarified as polling sugar, ReplacingMergeTree gone with the altitude cut.

Would appreciate another pass.

M
marek@tuist.dev Jun 16, 2026

Stepping back from the point-by-point: two more fundamental things, the first a disagreement.

1. Don’t share the CI fleet in v1. I’d reverse Goal 2 for now. Hive is internal-only, so sharing means risking customer-facing CI stability for a consumer that delivers no customer value yet. And admission control doesn’t buy what we need here: it arbitrates scheduling (runners win a scarce slot at create time) but does nothing about runtime interference. A sandbox that already won a slot and is co-scheduled with a runner still contends for CPU/IO, and on macOS a sandbox VM sharing a Mac mini with a runner VM puts the unproven guest-stability risk (Goal 5) on the same host as customer CI. For an internal-only, stability-unproven workload, the right v1 is dedicated sandbox capacity (a cordoned subset / separate pool) with the customer CI fleet isolated. Share later, once sandboxes are proven stable AND there’s external demand that justifies the risk. As a bonus this dissolves the warm-pool-vs-admission-control tension and the procurement-pace bet.

2. The spec states goals where it needs to show mechanism. Right altitude on the incidental stuff (good that Layout/DDL are gone) but the genuinely hard parts are still asserted, not solved. “Closes on observed pod terminal phase”, “warm-image restore in seconds”, “resume-where-I-left-off”, “clamp per-kind”: that is the design, and it’s a sentence each. This isn’t a reversal of the altitude cut, it’s the other half of it. Cut the mechanical detail; spend the reclaimed length on the uncertain mechanisms. The spec should be longest where the risk is. I’d want the core to be:

(a) one lifecycle diagram covering the session AND snapshot state machines (pending, ready, terminating, terminated, failed; plus the snapshot sub-lifecycle: requested -> snapshotting -> stored -> restoring -> ready), and

(b) a transition table: for each edge, the trigger, the component that drives it, the failure mode, and the billing/state-consistency rule.

Filling that table is where the open residuals actually get answered, because each is a transition the prose currently skips:

  • abnormal death -> terminated: what watches sandbox pod terminal phase? Runners get it from the controller’s POST /api/internal/runners/pods/stopped. Sandboxes need a named equivalent, or the OOM/node-loss/crash billing-close doesn’t actually exist.
  • ready -> snapshot: live or quiesced? captured state = memory+process or filesystem only? The macOS full-VM and Linux fs mechanisms deliver different fidelity, so one “resume where I left off” promise can’t hold across both as written.
  • stored -> restore: snapshot storage cost, retention, GC: unspecified, and billing only covers session intervals.
  • pending -> ready: warm-image restore vs cold-boot fallback (and, only if we don’t dedicate capacity, whether warm reservations are preemptible by runner demand).

I’d rather the spec spend its length here than on the goal list. Happy to pair on the state machine.

P
pedro@tuist.dev Jun 17, 2026

@marek thanks, both fair. Rev 16:

  1. Dedicated capacity in v1. Goal 2 reversed: cordoned sandbox pool, isolated from CI capacity. Goal 3 shrinks to just the capacity-planning signal (no shared contention to arbitrate). Reference model, Why, and Done-when updated to match.
  2. Mechanism over goals. Took a shot at the state machine. New “Lifecycle and state machines” section: sandbox transition table (seven edges, abnormal-exit driven by a named Sandbox Pod Watcher + POST /api/internal/sandboxes/pods/stopped, mirroring the runner controller), snapshot sub-lifecycle table, per-platform fidelity table (full_vm vs filesystem, the API surfaces kind so the asymmetry is visible), plus snapshot storage and warm-pool paragraphs.

Would still take you up on the pair session to refine the residuals, especially: snapshot fidelity (live vs quiesced per call or fixed per platform?), and warm-pool reservation semantics if we ever share the fleet.

M
marek@tuist.dev Jun 18, 2026

Rev 16 is the right move: named mechanisms (tart pause-copy-resume, CSI volume snapshot, Tigris storage, the Sandbox Pod Watcher) instead of asserted goals. Two things it still doesn’t answer, and they turn out to be the same problem.

Snapshot/restore is named but not timed

The fidelity and sub-lifecycle tables describe what happens, but “resume in a reasonable time” is exactly what’s missing:

  1. No latency budget, no Done-when criterion. The warm pool gets “resolves in seconds”; snapshot-create and restore get no target, and Done-when has no timing acceptance for either. The hard number is unstated.
  2. macOS full_vm is data-movement-bound and the spec hides the cost. “memory + process state” means dumping the VM’s multi-GB RAM image to disk and pushing it to Tigris on snapshot, then pulling multi-GB back on restore. That transfer is the latency, and it’s minutes-scale, not seconds. tart pause-copy-resume is presented as a checkbox when it’s the single hardest latency problem here. It deserves the same “gating, prove it with a budget” treatment guest-stability (Goal 5) and warm-image (Goal 4) already get.
  3. Restore can’t use the warm pool, but the spec implies create-speed. Restore runs Sandboxes.create(snapshot_ref) -> restoring -> ready. Warm pods are blank OS+Xcode; a restore needs this snapshot’s state rehydrated, which a warm pod can’t provide. So restore is a separate, slower, transfer-bound path with no budget. Linux CSI snapshot-to-PV is tractable; macOS multi-GB memory load is not, yet. Say so explicitly.
  4. Internal contradiction. “Snapshot capture happens in-place… the sandbox stays ready throughout” is false for macOS: pause-copy-resume freezes the VM for the duration of a multi-GB dump. Only Linux CoW is non-disruptive.

To answer the question: a per-platform latency budget for snapshot-create and restore, a Done-when criterion, an explicit “restore is not on the warm-pool fast path” note, and the macOS memory-image transfer called out as the gating risk with a proof point.

Idle offload isn’t addressed at all

There’s no idle concept in the state machine. The only automatic exits are deadline_at (reaper) and pod death. A sandbox that’s ready but doing nothing keeps its slot and keeps billing until the deadline or explicit DELETE. Agent loops are idle a lot (waiting on the model, between iterations, waiting on a human), so a customer hogs a dedicated-pool slot and pays for dead air up to an hours-long deadline.

The irony is that the machinery to solve this is already in the spec and just isn’t wired up. Idle offload is suspend-to-snapshot: detect idle -> snapshot (offload) -> free the slot and pause compute billing -> restore on next activity (the Cloudflare wake-by-name pattern). What’s needed:

  • an idle / suspended state, with a definition of idle (no exec / no SSH session / no port traffic for T),
  • the offload transition (suspend-to-snapshot) and the wake transition (restore-on-activity),
  • a billing rule for suspended time (compute pauses, only snapshot storage accrues),
  • the deadline interaction (does suspend stop the deadline clock?).

They’re the same question

You can’t offer transparent idle offload without fast restore. Auto-suspending a macOS sandbox and then taking minutes to reload a multi-GB memory image on wake is a bad experience, and that wake path can’t use the warm pool. So macOS fast-restore is the linchpin for both “reasonable snapshot time” and “don’t hog idle sessions,” and the shared root cause, moving VM memory state to/from object storage quickly, is the thing the spec hasn’t costed. I’d make that the centerpiece of the next revision (or the pair session): pick a latency target, prototype the macOS memory snapshot/restore against Tigris, and design the idle->suspend->wake loop on top of whatever that proves out.

P
pedro@tuist.dev Jun 18, 2026

@marek sharp catch on both, and you’re right they’re the same problem. Rev 17:

  1. Customer-controlled snapshot endpoint deferred from v1. The mechanism stays in the substrate but is internal-only. Goal 11 (customer snapshot) is gone; the POST /snapshots(/:snap/restore) endpoints are out of the API surface.
  2. Idle offload is now a v1 Goal (Linux only). New Goal 11: idle threshold 5 min (configurable), suspend = volume snapshot + pod release, wake on activity, compute billing pauses while suspended, deadline clock pauses while suspended, 7-day absolute wall-clock cap. New state-machine rows for ready -> suspending -> suspended -> restoring -> ready plus suspended -> terminating.
  3. macOS suspend/wake is a gated research goal. New Goal 12: prove that VM memory image transfer to/from Tigris is fast enough to be useful, “fast enough” decided from the prototype output (not committed up front per our discussion). Until then, macOS sandboxes have no idle offload and no snapshot.
  4. Fidelity table reframed and the in-place lie fixed. “Snapshot sub-lifecycle” -> “Suspend/wake fidelity”; Linux ships, macOS gated. Explicit note that if macOS suspension ships later, it transitions ready -> suspending with the VM paused during the dump, not in-place.
  5. Warm pool note: restore (suspended -> ready) explicitly does NOT use the warm pool; a restore needs the suspended sandbox’s state rehydrated, not a blank base image.

Still want to pair on the macOS prototype side: even without a committed number, the prototype design (sample VM size, transfer mechanism to Tigris, snapshot format, whether incremental memory snapshots are tractable) is where the real risk lives.

Sign in to comment

Comments are available to authenticated users.