kura: REAPI ByteStream uploads fail with fd_pool_exhausted under bursty concurrent uploads (Bazel remote cache)

Metadata

Source

tuist/tuist #11132

Updated

Jun 24, 2026

Domains

Kura

Details

Summary

When Kura is used as a Bazel/Buck2 remote cache, ByteStream/Write uploads intermittently fail under bursty concurrency with:

INTERNAL: failed to persist CAS blob: fd_pool_exhausted:
          timed out waiting 5s for file descriptor permit during open_file

A failed upload means the client can’t write that action’s result, so the action is never cached and re-executes on every build.

Where it bites

Cargo build scripts emit large directory outputs: librocksdb-sys produces ~339 .o files, which Bazel uploads concurrently via ByteStream. Each write needs FD-pool permits (temp-file create + segment-append open). The pool (auto-derived from RLIMIT_NOFILE) runs out, and rather than waiting/queuing, Kura fails the write after a 5 s permit timeout.

File-output actions (single blob) never hit this, so it’s invisible until a directory-output workload (rocksdb) shows up.

Evidence (rocksdb build-script round-trip against a Kura cache)

	default pool	`KURA_FILE_DESCRIPTOR_POOL_SIZE=4096` + `--ulimit nofile=16384`
`ByteStream/Write` failures (build #1)	~50 / 799 (`fd_pool_exhausted`)	0 / 799
`GetActionResult` on rebuild (build #2)	rocksdb = “action result not found” → recompile	145/145 hits, ~5 s, no recompile

So it’s purely an FD-pool capacity/behavior issue — bumping the pool + ulimit resolves it.

Proposed fix (options)

Write backpressure instead of hard-fail — when the FD pool is saturated, wait/queue (or return a retryable status) rather than failing the write after 5 s. This is the robust fix for any concurrent-upload client.
Size the pool for client upload concurrency — raise the default / auto-derivation headroom, and/or document tuning for build-cache deployments.
Reduce per-write FD usage on the ByteStream → segment path.

Notes

Distinct from (and in addition to) the ByteStream flush fix in #11129, which is necessary but not sufficient for these workloads.
Currently mitigated only at the deployment level (local-ci sets the pool + ulimit on its cache node). Production Kura-as-remote-cache for Bazel/Buck2 needs a real answer here.

Comments

TA

tuist-atlas[bot] Jun 9, 2026

A fix for this issue is now available in Kura 0.7.4. The FD pool is now derived from process capacity, which should resolve the fd_pool_exhausted errors during bursty concurrent uploads. Update to ghcr.io/tuist/kura:0.7.4.