Hive
fix(kura): flush REAPI ByteStream uploads before persisting
GitHub issue · Closed
What
The REAPI ByteStream/Write handler streamed upload chunks into a temp file with write_all, then persisted the blob by re-opening that path on a separate descriptor (stat + copy into a segment) — without flushing the temp file first. This flushes and closes the temp file before persisting, and adds a regression test.
Why (root cause)
tokio::fs::File buffers writes and flushes lazily. With no explicit flush, the persist path’s separate-descriptor stat + copy raced the background flush and intermittently failed with failed to persist CAS blob: appended N bytes into segment …, expected M (ByteStream/Write → INTERNAL).
The failure scales with the number of concurrent blob uploads, so it specifically and reliably broke remote caching of cargo build scripts: a build script’s directory OUT_DIR uploads dozens of blobs at once (e.g. librocksdb-sys ≈ 339 .o files), a subset of those ByteStream/Writes failed, Bazel then skipped UpdateActionResult for those actions, and they re-executed on every build — librocksdb-sys recompiled on every clean build (~8 min), cascading to everything that links it. Plain file-output actions (rustc rlibs) upload a single blob and rarely raced, so they cached fine and masked the bug.
Found while dogfooding Kura as the Bazel remote cache for the Kura Bazel build: a fresh build got hundreds of Kura cache hits yet still recompiled rocksdb.
Why this fix
Kura’s HTTP upload path (read_request_to_temp in utils.rs) already flushes before persisting; the ByteStream handler was the lone exception. Flushing + dropping the handle makes persist re-read a fully-written, closed file — minimal and matches the existing pattern. The action-cache and per-blob CAS round-trips were already correct; only the upload’s durability ordering was wrong.
Impact
- Remote caching of cargo build scripts (and any action uploading many blobs concurrently / with directory outputs) now works through Kura.
- No behavior change for already-working file-output actions.
Validation
Built a Kura image with the fix and re-ran a build-script round-trip against it, comparing Bazel --remote_grpc_log:
| before | after | |
|---|---|---|
ByteStream/Write failures |
24/253 | 0/253 |
actions cached (UpdateActionResult) |
59/81 | 81/81 |
GetActionResult not-found on rebuild |
22 | 0 |
| build scripts re-executed on rebuild | 22 | 0 |
This reaches parity with Bazel’s local --disk_cache (81/81). A new regression test (bytestream_writes_persist_completely_under_concurrency) drives the real gRPC handler with concurrent multi-chunk uploads and asserts byte-exact round-trips; it fails reliably (5/5) without the fix and passes with it. cargo clippy --all-targets -- -D warnings and cargo fmt --check are clean.
🤖 Generated with Claude Code