fix(kura): recover the wedged replication mesh and make fresh-node bootstrap fast (read timeout + parallel fetch)

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11464

Updated

Jun 24, 2026

Domains

Kura

Details

What

The tuist Kura replication mesh wedged: every tuist node sat 0/1 not-ready for hours, bootstrap stuck at “N artifact(s) failed this pass, 0 applied”, peers logging failed to stream bootstrap body: error decoding response body with no error on the serve side. That took down the macOS CI cache (the runner cache) and the cross-region tuist mesh, while customer single-region instances stayed healthy because they were idle and never re-bootstrapping.

This PR carries the two source fixes, a defensive hardening guard, and the standalone image-build workflow that let the branch be rolled to production for validation before merge.

Bug 1: a total request timeout cut streaming bootstrap bodies mid-transfer

The inter-peer mTLS client (peer_tls.rs) used a hard 30s total request timeout. Bootstrap streams an artifact’s entire body over this client. Under cold-start load, when all mesh nodes re-bootstrap at once and the path is bandwidth-limited and congested, a large artifact’s transfer exceeds 30s and is aborted mid-body. The requesting peer sees an incomplete, undecodable response (error decoding response body); the serve side just sees the connection drop, so it logs nothing. Because bootstrap marks a peer done only on a fully clean pass, those few large artifacts that cannot transfer in 30s block every node from ever becoming ready.

Fix: switch the peer client from .timeout(30s) (total) to .read_timeout(30s) (idle, resets on each chunk), keeping the 5s connect timeout. A slow but progressing transfer now completes; a genuinely stalled connection still fails fast. The bootstrap loop’s own 30-minute cap remains the backstop. The peer request and stream errors now log the source chain ({error:?}) so the nested transport cause is visible instead of only reqwest’s top-level Display.

Bug 2: serial bootstrap fetch could not replicate a full dataset within the timeout

Once the mesh was recoverable, a second problem surfaced: a fresh or rebuilt regional node has to bootstrap the entire dataset from a peer, and bootstrap_manifests_from_peer fetched artifact bodies strictly serially (one GET, await the whole body, apply, next). Over the WAN that leaves the link idle between every request, so a large cache cannot finish a bootstrap pass inside the 30-minute KURA_BOOTSTRAP_TIMEOUT_MS. The pass is cancelled (bootstrap timed out after 1800000 ms, a mid-transfer cancel rather than “0 applied”) and retried from where it left off, so the node converges only after many 30-minute passes: hours for a large cache, which is why a destroyed-and-rebuilt regional node sat Pending indefinitely.

The ceiling was confirmed to be the serial fetch, not bandwidth or apply: the replication bandwidth limiter was at its 512 MiB/s default (verified in prod via kura_replication_bandwidth_configured_limit_bytes_per_second, far above the link), and apply writes to a local SSD segment behind segment_write_lock.

Fix: fetch up to BOOTSTRAP_ARTIFACT_FETCH_CONCURRENCY (16) artifact bodies per page concurrently with buffer_unordered. This reuses machinery the path already had: bootstrap_staging_budget reserves bytes per fetch and back-pressures when concurrent bodies would exceed the tmp budget, and the segment-append lock still serializes the on-disk write, so only the WAN transfers overlap. The cheap local already-have pre-check runs first so a real store error still aborts the pass rather than being masked as a per-artifact fetch failure, and the partial-failure / clean-pass-serves semantics are unchanged: a failed body still increments the failed count, a non-clean pass is still retried, and already-applied artifacts are skipped on retry.

Hardening (investigated, not the cause here)

A guard for the documented “segments are append-only / never truncated” invariant: open_manifest_reader_with_range now verifies the backing file actually covers the artifact before serving (a truncated file returns Err, becomes a 404, and the bootstrap client skips it as IgnoredStale rather than streaming a silent short body), and SegmentReader errors loudly on EOF-before-remaining instead of ending silently. This was a candidate during the incident; it turned out not to be the cause (the guard logged zero times in prod, segments were intact), but it is a correct hardening: a short body must never be served as complete.

What it took to get here

This took several wrong turns before the cause was nailed, worth recording so the next incident skips them:

0.10.10 / mmap residency gate (#11431): deployed plus a full mesh restart, no effect.
Segment truncation: built and deployed the guard above, the guard logged zero times across all pods, so segments were not truncated.
kura-tuist-peers headless Service deadlock: disproven, the Service has publishNotReadyAddresses: true with all pods in its endpoints and each peer’s mTLS listener up.

The actual operational unblock during the incident was removing the failing peers (the regional servers were destroyed), which let the runner cache recover on the old image. The fixes in this PR are what prevent the deadlock from recurring on a full-mesh cold-start and what make a fresh node’s bootstrap fast.

Impact

The runner cache and cross-region mesh recover from a full-mesh cold-start instead of deadlocking.
A fresh or rebuilt regional node replicates the full cache in minutes instead of hours, or never when the dataset exceeded the 30-minute cap.

Validation

cargo check plus cargo test --lib clean (264 tests, including the bootstrap partial-failure, failpoint, and concurrent-staging-budget cases), clippy and fmt clean.
Built the ad-hoc image ghcr.io/tuist/kura:sha-866c922579a2 via the standalone Kura Runtime Image workflow (build only, no staging deploy).
Rolled kura-only to production (server pinned, kura image swapped) and watched it converge:
- 09:16:27 the runner cache and a regional pod restart onto the fix image (briefly not-ready)
- 09:17:16 the runner cache is Ready again, under a minute of not-ready, no deadlock
- 09:18:03 all 4 tuist kura pods Ready on the fix image, deploy success
A freshly rebuilt regional instance had independently completed its bootstrap on the prior image over roughly 1.5 hours, confirming the bootstrap does converge given enough time and that the only remaining gap was speed, which is what Bug 2’s fix closes. It retained its volume across the rolling image update, so it re-readied on a delta bootstrap.

Follow-ups

Production is pinned to the ad-hoc sha-866c922579a2. After this merges, the push-to-main pipeline cuts a kura@ release and the next regular deploy moves prod off the ad-hoc tag onto it.
#11461 (server-side readiness fallback to the public cache) ships next as the graceful-degradation net, now that the source fix is in prod.
The bundled Kura Runtime Image workflow matches #11294’s except for a one-line mise version bump to the repo standard 2026.5.15 (#11294 was still on the stale 2026.4.18, and should pick up the same bump). They reconcile with a trivial one-line conflict at most.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.