Hive Hive
Sign in

Kura REAPI: GetActionResult returns hits whose output blobs were evicted, permanently breaking Bazel builds with “Lost inputs” errors

GitHub issue · Open

Metadata
Source
tuist/tuist #11221
Updated
Jun 11, 2026
Domains
Kura
Details

Summary

Kura’s REAPI action cache and its CAS blob storage have decoupled lifecycles. Action results are stored as keyvalue protos in RocksDB with no expiry, while blob bodies live in a fixed-size segment ring that deletes the oldest segment’s blobs on rotation. GetActionResult returns cached results without verifying that the referenced output blobs still exist, so once the ring rotates, clients receive action-cache hits pointing at deleted blobs.

For Bazel (which builds “without the bytes” by default, --remote_download_outputs=toplevel), this is unrecoverable without manual intervention:

ERROR: ... Running Cargo build script serde_core [for tool] failed:
Lost inputs no longer available remotely:
external/rules_rust++i2+rrc__serde_core-1.0.228/_bs- (ae989bcc.../466832),
external/rules_rust+/cargo/private/cargo_build_script_runner/runner (25da80b4.../653024)
...
Found transient remote cache error, retrying the build...

Bazel’s automatic rewind/retry loop cannot converge because every retry re-accepts other poisoned action-cache hits, and Bazel’s local action cache re-validates against the remote AC after --experimental_remote_cache_ttl (3h default), so even previously-green local state degrades back into the failure. A bazel clean --expunge makes every build fail immediately. The only recovery is a full rebuild with --noremote_accept_cached.

Root cause

Two interacting facts:

  1. Fixed-capacity segment ring (src/constants.rs, src/store.rs): MAX_SEGMENT_BYTES = 512 MiB with desired generations DESIRED_OLD_SEGMENTS = 1, DESIRED_CURRENT_SEGMENTS = 2, DESIRED_NEW_SEGMENTS = 2 — roughly 2.5 GiB of artifact-body capacity per node, hard-coded, independent of available disk. Segment rotation (active_segmentstate.push_newevict_segments) unlinks the oldest segment and deletes its artifacts. A single bazel build of the Kura crate graph writes ~1.3 GiB, so a few build passes rotate the ring within hours. Observed on a node with 496 GiB free disk.
  2. No existence validation on action-cache reads (src/reapi/mod.rs, get_action_result): the ActionResult proto is fetched from the keyvalue store and returned as-is. Segment eviction correctly removes the CAS index entries (so FindMissingBlobs is honest and uploads self-heal), but nothing removes or invalidates the ActionResult entries that reference the deleted digests.

Suggested fix

Primary — validate referenced blobs in GetActionResult: before returning a cached ActionResult, check that every referenced digest still exists in the CAS (output_files[].digest, output_directories[].tree_digest plus the Tree’s sub-blobs, stdout_digest, stderr_digest). If any are missing, return NOT_FOUND (and ideally delete the stale AC entry). This is what other REAPI caches (bazel-remote, Buildbarn) do, and it is what the Remote Execution API expects of action caches backed by evicting CAS storage: eviction then degrades to a cache miss → the client re-executes and re-uploads, instead of failing. The existence checks are metadata lookups that can go through the existing existence cache, so the per-hit cost is bounded; correctness here is worth the extra lookups.

Secondary — make CAS capacity configurable: expose the segment-ring generation counts (or a byte budget) as a KURA_* env var, or derive them from the data volume size like the FD/memory budgets already are. 2.5 GiB is smaller than the working set of a single Rust/Bazel build, so REAPI workloads churn the ring constantly, maximizing exposure to the poisoning above (and generally getting near-zero cache hit rates on cold builds).

Touch-on-FindMissingBlobs/AC-hit (lease emulation): treat existence checks and action-cache hits as liveness signals. When a queried or referenced blob sits in an old-generation segment, refresh it forward into the active segment — the maybe_refresh_manifest machinery already does exactly this, but today it only triggers on actual blob reads. Under Bazel build-without-the-bytes, intermediate output blobs get AC hits and FindMissingBlobs queries but are never read, so refresh-on-read never extends their lifetime even though they are logically hot; this closes that blind spot, attacking the eviction itself rather than just detecting it. It also covers the dedup case: a client that skips an upload because FindMissingBlobs reported the blob present simultaneously rescues that blob from eviction. Composes with Bazel’s lease-extension behavior (--experimental_remote_cache_ttl + lease extension), where the client periodically calls FindMissingBlobs for blobs it still depends on, expecting the server to keep them alive. Cost: occasional sequential re-appends into the active segment, already gated by the existing memory-pressure check.

O(1) AC age gate (cheap alternative to per-ref validation): instead of checking every referenced digest on GetActionResult, compare one timestamp: return NOT_FOUND when the AC entry’s version_ms is older than the oldest ring segment’s created_at_ms. The invariant that makes this sound: if every blob a new AC entry references is in the ring’s young generations at AC-write time (the client uploaded them just before, or touch-on-FindMissingBlobs refreshed the deduplicated ones), then an AC entry can only dangle once it predates the oldest surviving segment. Hot-path cost is essentially zero — a single comparison against segment state, cacheable in memory — with no per-ref lookups and no Tree sub-blob walking problem. Trade-offs: it is conservative (rejects some still-valid old entries, which degrade to a clean miss → re-execute) and it requires the touch mechanism above to hold the invariant; without it, deduplicated blobs older than the AC entry break the soundness argument.

The first fix is the correctness fix — with it, capacity only affects hit rates, never build outcomes. Without it, any blob loss (eviction, manual cleanup, future retention policies) silently poisons every action-cache entry that references the lost blobs. The age gate is an alternative serve-time check when per-ref validation is judged too costly for the hottest RPC, and the touch mechanism complements either: it reduces how often hot blobs evict in the first place instead of only converting the failure into a miss.

Reproduction

  1. Point Bazel at a Kura node (--remote_cache=grpc://...) and run a build whose outputs exceed ~2.5 GiB cumulative (or run several full builds so the segment ring rotates).
  2. bazel clean --expunge
  3. bazel build <target> → fails with Lost inputs no longer available remotely for early-build tool outputs (e.g. rules_rust process_wrapper, cargo build-script runners); Bazel’s retry loop repeats the failure indefinitely.

Observed on Kura 0.8.0 (dev image built from main) with Bazel 9.1.0 building the Kura crate graph itself via rules_rust. Production deployments run the same constants, so any sufficiently large Bazel adopter will hit this.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 13, 2026

A fix for the segment rotation and CAS accounting issues described here is now available in kura@0.10.1 via PR #11246. The release hardens segment disk accounting at rotation and startup to prevent the eviction problems that caused GetActionResult to return stale action-cache hits. Update your Kura deployment to ghcr.io/tuist/kura:0.10.1 to pick up this fix.

TA
tuist-atlas[bot] Jun 13, 2026

Looking at the release notes for Kura 0.10.0 (kura@0.10.0), the feature “derive CAS segment-ring capacity from disk size” addresses the capacity configuration aspect of this issue.

This change implements the secondary suggestion from the issue description: instead of using the hard-coded 2.5 GiB segment ring capacity, Kura now derives the CAS segment-ring generation counts from the available disk size. This should reduce premature segment rotation on nodes with ample storage.

Update to kura@0.10.0 (Docker image: ghcr.io/tuist/kura:0.10.0) to get this improvement.

TA
tuist-atlas[bot] Jun 19, 2026

Looking at the release notes for kura@0.10.4, I see this includes a performance improvement that caches the parsed segment ring state in memory (PRs #11247 and #11249). This change is related to the segment ring lifecycle management discussed in this issue.

The release is now available as ghcr.io/tuist/kura:0.10.4. You can update to this version to pick up the segment ring state caching improvement.