Hive
Kura: large-artifact peer replication can’t finish within the fixed 30s peer timeout
GitHub issue · Open
Summary
Kura advertises a maximum replicated artifact body of 2 GiB (MAX_REPLICATION_BODY_BYTES = 4 × MAX_SEGMENT_BYTES), but peer-to-peer replication uses an HTTP client with a fixed 30 s total request timeout that is not configurable (kura/src/peer_tls.rs:76-77):
Client::builder()
.connect_timeout(Duration::from_secs(5))
.timeout(Duration::from_secs(30))
When an artifact can’t be transferred and ingested+assembled by the receiver within 30 s, the request is aborted, the partial receive is discarded, and the outbox retries from byte 0 — forever. The artifact never lands on peers. This is the same failure shape as the gateway HTTP/2 window regression (#11266), but on the node→node replication path (:7443), which the gateway window fix does not touch.
Required throughput to honor the 2 GiB limit
For replication to actually support the artifact sizes it accepts, the entire request — body transfer + the receiver’s synchronous temp-write and segment-assemble + the 204 response — must complete within 30 s.
Transfer-only floor (ignores receiver assembly time):
| Artifact | Rate to move the body in 30 s |
|---|---|
| 512 MiB (one segment) | ~17.9 MB/s (~143 Mbps) |
| 1 GiB | ~35.8 MB/s (~286 Mbps) |
| 2 GiB (advertised max) | ~71.6 MB/s (~573 Mbps) |
But the receiver streams the body to a temp file and then copies it into a segment before replying 204, so a 2 GiB artifact incurs ~6 GiB of receiver disk I/O (write temp + read temp + write segment) inside the same 30 s window. At ~520 MiB/s that is ~12 s of pure disk, leaving ~18 s for transfer → the link/sender must actually sustain ~118 MB/s (~945 Mbps) end-to-end. On slower or contended disks the assembly alone can eat the whole 30 s.
Bottom line: to replicate the maximum 2 GiB body cross-region within the current 30 s timeout, every inter-region peer link must sustain ≳ 573 Mbps (transfer floor), and realistically ~0.9–1 Gbps once receiver assembly is counted — per replication stream.
This is hard cross-region because a single connection’s throughput is bounded by the flow-control window ÷ RTT (bandwidth-delay product). At ~150 ms inter-region RTT, one stream needs roughly an 11–18 MB in-flight window just to reach 573–945 Mbps, far above typical defaults. (Same window-vs-RTT effect #11266 fixed for the gateway gRPC path — it must be verified for the peer client too.)
Why it’s worse in practice
Several mechanisms push real throughput well below those targets:
- Un-pipelined sender reads:
SegmentReader(kura/src/segment/reader.rs) reads its own segment in 512 KB chunks with a single in-flightspawn_blockingread and no read-ahead. Under CPU contention this alone dropped to ~4.6 MB/s in testing. - Adaptive bandwidth limiter:
effective_bytes_per_second(kura/src/bandwidth.rs) divides the 512 MiB/s default by a public-latency-pressure divisor up to 64×, i.e. as low as 8 MiB/s. At 8 MiB/s only ~240 MiB can replicate in 30 s — less than a single 512 MiB segment. - No resume: every timeout retries from byte 0, so a stream that gets 90% through wastes all of it (observed ~4.7 GB of wasted egress across retries for one 520 MiB artifact).
Evidence
Reproduced on a 3-node docker-compose mesh uploading a 520 MiB module (kura/spec/e2e/large_artifact_spec.sh, added in #11296):
- A direct
curlPUT (flat file, no sender machinery) to a peer’s/_internal/replicate/artifactreturns 204 in ~4 s — the receiver is fast. - Kura’s own replication produced the body at ~4.6 MB/s, hit the 30 s timeout at ~125 MB, discarded the partial, and retried ~every 30 s indefinitely (eu/ap stayed 404). No OOM, no 413 — purely the timeout.
Potential solutions
A fixed total request budget (30 s) is fighting both the artifact size and the throughput limiters, so bumping the number only moves the cliff. The better fixes attack throughput and the timeout’s semantics. The three highest-leverage changes are ⭐1, ⭐2 and ⭐4.
⭐ 1 — Make the peer timeout idle/progress-based, not a fixed total.
Replace the client-level .timeout(30s) (kura/src/peer_tls.rs:76-77) with a timeout that resets while bytes flow, aborting only a genuinely stalled transfer. Any artifact size then completes as long as it makes progress, with no unbounded hang. This changes the timeout’s semantics, not its magnitude — it removes the size cliff entirely rather than relocating it.
⭐ 2 — Don’t let rate-shaping starve replication into a livelock.
The adaptive limiter (kura/src/bandwidth.rs) divides the 512 MiB/s default by the public-latency-pressure divisor (up to 64× → ~8 MiB/s ⇒ only ~240 MiB in 30 s, less than one segment). Exempt replication from the divisor, floor the effective replication rate at “enough to move a segment within the budget”, or only shape replication under genuine sustained public contention. This is the specific knob that fails the e2e on CI: serving the large GET inflates the public-latency EWMA → divisor climbs → replication is throttled below the rate the 30 s timeout requires.
⭐ 4 — Speed up the sender’s body production.
SegmentReader (kura/src/segment/reader.rs) streams 512 KB chunks with a single in-flight spawn_blocking read and no read-ahead, so throughput collapses under CPU contention (~4.6 MB/s observed). Use multiple in-flight reads / larger chunks, or mmap the segment region and stream from memory (the accelerated-serving path already mmaps segments). Higher MB/s directly shrinks the in-request transfer time.
Additional / longer-term:
- 3 — Take segment-assembly out of the request. The receiver writes body→temp and copies temp→segment before replying 204 (~3× the artifact in disk I/O; ~6 GiB for a 2 GiB artifact). fsync the temp, reply 204, then assemble asynchronously (serving from the staged file until assembled) → in-request work drops to ~transfer + one durable write.
- 5 — Resumable / range-based replication. Retries continue instead of restarting from byte 0 — the most robust option for huge artifacts and bad cross-region links, and it eliminates the wasted re-send. Largest change.
- Verify the peer client’s transport flow-control window vs. inter-region RTT (a single stream’s throughput is window ÷ RTT), mirroring #11266.
Interim mitigation (applied)
#11296 disables the replication bandwidth limiter for the large-artifact e2e suite only (KURA_REPLICATION_BANDWIDTH_LIMIT_BYTES_PER_SECOND=0, via a compose override), so the test exercises replication correctness without the rate-shaping that triggers this bug. It does not change production behavior — this issue stays open for the real fix (⭐1 + ⭐2 + ⭐4).
Related
- #11266 — gateway HTTP/2 upload windows (same window-vs-RTT class, public gRPC path)
- #11296 — large-artifact upload/replication e2e (exposes this; limiter disabled there as the interim unblock above)
No GitHub comments yet.