feat(kura): derive CAS segment-ring capacity from disk size

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11222

Updated

Jun 24, 2026

Domains

Kura

Details

Addresses the capacity portion of #11221 (the GetActionResult blob-existence validation remains as follow-up).

What changed

The CAS segment ring’s generation counts are now resolved at store startup instead of being compile-time constants:

KURA_CAS_CAPACITY_BYTES (new, optional) sets the artifact-body budget explicitly. Exposed in the Helm chart as config.casCapacityBytes.
When unset, the budget derives from the data-dir filesystem size: 50% of the filesystem’s total bytes.
Configured or derived, the budget is capped at 80% of the filesystem, so the resident segments plus the extra segment a rotation appends before evicting the oldest one can never run the disk full. The existing free-space check at rotation time (SEGMENT_FREE_SPACE_MARGIN) stays as the runtime guard.
The legacy 1/2/2 ring is the floor: hosts where the filesystem size cannot be determined (statvfs failure, non-unix) and tiny disks keep exactly today’s behavior. Generation counts keep the legacy 1:2:2 old/current/new proportions at any size.

Why

MAX_SEGMENT_BYTES = 512 MiB with hard-coded 1/2/2 desired generations gave every node ~2.5 GiB of artifact-body capacity regardless of volume size — observed evicting hours-old blobs on a node with 496 GiB free. A single Bazel build of the Kura crate graph pushes ~1.3 GiB through the cache, so REAPI workloads rotated the ring constantly; combined with action-cache entries that outlive the blobs they reference (#11221), this broke builds with unrecoverable Lost inputs no longer available remotely errors.

With this change a node sized like the managed deployments (10 GiB volumes) holds ~5 GiB of artifacts by default instead of 2.5 GiB, and operators can size the budget deliberately without forking constants.

Why the default is 50% (and not the 80% cap)

The two percentages answer different questions: 80% is the hard guard (“even an explicit operator config must never be able to program an ENOSPC”), while 50% is the unattended default (“what is safe to take on every node in the fleet without knowing the workload”). The default cannot sit at the guard for a few reasons:

The budget is computed from the filesystem’s total bytes (statvfs f_blocks), not available space — it does not subtract what the volume’s other tenants already use or will grow into.
The segment ring is not the only tenant of the volume, and the largest co-tenant is unbounded. RocksDB shares the data dir: the key_value CF (inline artifacts — REAPI ActionResults, keyvalue payloads) has no eviction and grows monotonically until a namespace delete; the outbox can back up to 100k messages during a peer outage; the WAL and compaction transients add more (the configured compaction limits tolerate up to 64–256 GiB of pending debt). Filesystem blob artifacts (blob_path, e.g. multipart module uploads) live outside the ring’s accounting entirely, and the tmp dir adds up to 8 GiB of staging when it shares the volume.
The failure modes are asymmetric. A smaller ring only costs hit rate. A full disk fails cache writes (disk_full) and can wedge RocksDB on ENOSPC — an outage needing manual intervention. A default has to be wrong in the cheap direction.
Reversibility. Raising the budget later is a config change (KURA_CAS_CAPACITY_BYTES, still capped at 80%); recovering a disk-full node is not. An operator who has measured their metadata footprint can deliberately push toward the cap — that is what the knob is for.
It mirrors the existing resource pattern. Memory uses the same shape: 70% soft default, 85% hard limit. 50/80 is the disk analogue — a default operating point with margin, and a never-exceed line.
Rollout blast radius. This ships fleet-wide with no operator action and is already a ~40× capacity jump for production nodes; the marginal hit-rate gain of defaulting to 80% is small next to debuting at the ENOSPC boundary everywhere at once.

Why the 1:2:2 generation split is preserved (and what the `old` share controls)

The ring’s three generations look like a pipeline, but the code only ever distinguishes old vs not-old: refresh-on-read (maybe_refresh_manifest) fires exclusively for artifacts whose segment is in the old generation, only new.last() receives writes, and new vs current is never checked anywhere else (the split surfaces only in the kura_segment_generation_count metric). So the generation counts encode three real quantities:

total segments → ring capacity,
new + current → the quiet zone: how long an artifact lives with zero refresh tax and zero eviction risk,
old → the rescue window: the stretch of the ring’s tail where a read gives an artifact a second chance by copying it forward into the active segment before it falls off the end.

Keeping old at ~20% of the ring (the resolver computes old = total / 5, which is exactly the legacy 1-of-5 proportion) is the part that matters:

Too small, and the second chance disappears. If old stayed pinned at 1 segment while this PR grows rings to hundreds of segments, the rescue window would shrink to a fraction of a percent of the cache — effectively FIFO with no second chance. An artifact would have a single rotation interval (minutes, under heavy ingest) to be re-read before deletion, so periodic workloads (nightly builds, weekly release branches) would never be rescued, and the refresh mechanism that exists to keep hot artifacts alive would be vestigial. This is also forward-looking: any lease-style mitigation for #11221 (touching blobs on FindMissingBlobs/AC hits) can only rescue blobs that are still inside the old window, so its width bounds how effective those fixes can ever be.
Too large, and reads get taxed. Every rescue is an inline full copy of the artifact (paid by the triggering request) serialized through a node-wide refresh lock, plus a second fsync’d metadata batch. A wider old fraction raises the probability that any given read lands in the taxed zone, and the rescue appends themselves fill the active segment faster — accelerating rotation in a feedback loop. It also comes straight out of the quiet zone, shortening the tax-free residency of fresh writes.
Proportions keep behavior scale-invariant. Deriving the split as ratios rather than absolute counts means a node behaves the same way at any disk size — the 5-segment legacy floor and a 300-segment ring are the same policy (“the last fifth of the ring is the grace period”), which makes hit-rate and refresh-load characteristics predictable as fleets move between volume sizes, and means this PR changes capacity without silently changing the eviction policy.

20% is not claimed to be optimal — it is the proportion production has been running implicitly all along, preserved so this PR changes exactly one variable (capacity) and the old/quiet trade-off can be tuned later with data if refresh metrics suggest it.

Rollout safety

Node-local policy only: no on-disk format, wire format, or replication protocol change. Mixed-version meshes are unaffected — old nodes keep the 2.5 GiB ring, new nodes derive theirs, and the replication/bootstrap paths never inspect generation counts of peers.
The ring can only grow under this change for existing deployments (floor = legacy counts), so no eviction storm on rollout; disk usage grows toward the derived budget, which the 80%-of-filesystem cap bounds.
Ships with the matching Helm values plumbing (config.casCapacityBytes, omitted by default).

Validation

New unit tests for the resolver: disk-derived default, explicit override, 80% ceiling clamping an oversized override, legacy floor for tiny disks / missing disk info / tiny configured values, and proportional splits.
New config tests: KURA_CAS_CAPACITY_BYTES unset → None, valid parse, rejects 0 and non-numeric values.
cargo test — 243 passed; cargo clippy --all-targets -- -D warnings — clean; cargo fmt --check — clean.
helm template renders the env var when config.casCapacityBytes is set and omits it entirely when unset.

🤖 Generated with Claude Code

Comments

esnunes Jun 12, 2026

Approving. The disk-size-derived budget is a clear win over the hard-coded 2.5 GiB ring and unblocks #11221.

One thing worth recording for follow-up rather than blocking on here: the “can never run the disk full” reasoning assumes every segment stays bounded by MAX_SEGMENT_BYTES, but active_segment in kura/src/store.rs only uses that as a rotation trigger. A single artifact up to MAX_MODULE_TOTAL_BYTES (2 GiB, the HTTP/REAPI cap) gets appended whole into a fresh segment, so the resulting segment file can be ~4x the assumed size. The 20% headroom absorbs this on the realistic multi-GiB volumes, but on small disks where the legacy 5-segment floor wins (~4 GiB and below), one oversize module can push resident bytes past the disk.

Related: the pre-rotation free-space check at kura/src/store.rs:1079 is MAX_SEGMENT_BYTES * 2 = 1 GiB regardless of incoming_size, so an oversize write on a tight disk slips past the check and fails mid-copy with a plain I/O error instead of DISK_FULL_MARKER, leaving orphaned bytes at the head of the new segment until rotation.

Neither is a regression from this PR (both pre-date it), and the right fix probably routes oversize artifacts to standalone blobs since ArtifactManifest.blob_path and every read path already handle that branch. Worth a follow-up issue, not a change here.

Yeap, the incoming_size is handled by https://github.com/tuist/tuist/pull/11246