Hive Hive
Sign in

kura: REAPI ByteStream uploads fail with fd_pool_exhausted under bursty concurrent uploads (Bazel remote cache)

GitHub issue · Closed

Metadata
Source
tuist/tuist #11132
Updated
Jun 24, 2026
Domains
Kura
Details

Summary

When Kura is used as a Bazel/Buck2 remote cache, ByteStream/Write uploads intermittently fail under bursty concurrency with:

INTERNAL: failed to persist CAS blob: fd_pool_exhausted:
timed out waiting 5s for file descriptor permit during open_file

A failed upload means the client can’t write that action’s result, so the action is never cached and re-executes on every build.

Where it bites

Cargo build scripts emit large directory outputs: librocksdb-sys produces ~339 .o files, which Bazel uploads concurrently via ByteStream. Each write needs FD-pool permits (temp-file create + segment-append open). The pool (auto-derived from RLIMIT_NOFILE) runs out, and rather than waiting/queuing, Kura fails the write after a 5 s permit timeout.

File-output actions (single blob) never hit this, so it’s invisible until a directory-output workload (rocksdb) shows up.

Evidence (rocksdb build-script round-trip against a Kura cache)

default pool KURA_FILE_DESCRIPTOR_POOL_SIZE=4096 + --ulimit nofile=16384
ByteStream/Write failures (build #1) ~50 / 799 (fd_pool_exhausted) 0 / 799
GetActionResult on rebuild (build #2) rocksdb = “action result not found” → recompile 145/145 hits, ~5 s, no recompile

So it’s purely an FD-pool capacity/behavior issue — bumping the pool + ulimit resolves it.

Proposed fix (options)

  1. Write backpressure instead of hard-fail — when the FD pool is saturated, wait/queue (or return a retryable status) rather than failing the write after 5 s. This is the robust fix for any concurrent-upload client.
  2. Size the pool for client upload concurrency — raise the default / auto-derivation headroom, and/or document tuning for build-cache deployments.
  3. Reduce per-write FD usage on the ByteStream → segment path.

Notes

  • Distinct from (and in addition to) the ByteStream flush fix in #11129, which is necessary but not sufficient for these workloads.
  • Currently mitigated only at the deployment level (local-ci sets the pool + ulimit on its cache node). Production Kura-as-remote-cache for Bazel/Buck2 needs a real answer here.
Comments
TA
tuist-atlas[bot] Jun 9, 2026

A fix for this issue is now available in Kura 0.7.4. The FD pool is now derived from process capacity, which should resolve the fd_pool_exhausted errors during bursty concurrent uploads. Update to ghcr.io/tuist/kura:0.7.4.