Hive
fix(kura): let a degraded mesh recover instead of deadlocking or wedging out-of-space
GitHub issue · Closed
Draft — captures the fixes for the failure modes surfaced while recovering a degraded mesh live. Not required to fix anything live; once released, a degraded/restarted mesh recovers on its own instead of getting stuck.
Two compounding failure modes turned a recoverable mesh into a stuck one when every replica was restarted at once (e.g. cycling the whole StatefulSet onto a new image). This PR fixes both.
1. Bootstrap deadlock — abort on first failed artifact
bootstrap_manifests_from_peer did bootstrap_artifact_from_peer(...).await...? — a single artifact error aborted the whole peer bootstrap, stranding every artifact after it and never marking the peer bootstrapped, so the node never reached serving.
Harmless with healthy peers; deadlocks a mesh bootstrapping from itself: when every replica restarts together, each serves still-incomplete data while bootstrapping, so every bootstrap breaks at its first gap, none makes progress, and the mesh sits at 0/N ready with a flat stream of error decoding response body.
Fix: record per-artifact failures, keep going so each pass applies everything it can, and return an error only at the end so the peer is retried (already-applied artifacts are skipped). Successive passes make forward progress and converge as data propagates from the data-bearing replicas. A peer is marked bootstrapped — node allowed to serve — only on a fully clean pass, so readiness still implies complete data.
2. Out-of-space wedge — leaked staging fills the disk
Each failed transfer leaked its staging file: stream_response_to_temp removed the temp file on the size-overrun path but returned via ? on stream/write errors, leaving the partial file. A retrying bootstrap (the deadlock above produced these en masse) accumulated one file per attempt until the data disk filled and RocksDB failed to open with No space left on device — crashlooping the pod with no way out.
Fixes:
stream_response_to_tempremoves the partial file on any error, not just overrun.ensure_directoriesclears transient staging (tmp_dir’suploads/parts/bootstrap) on startup, beforeStore::open— everything there is dead after a restart, so a pod whose disk already filled with stale staging reclaims space and reopens RocksDB instead of staying wedged. Data dirs are untouched.
Observed in production
After recovering a degraded mesh by cycling all replicas onto 0.10.7, every pod came up Running — no crashloops, no version skew — but stayed 0/9 ready for 10+ min with a non-decreasing error decoding response body rate. Separately, the replicas that had been failing for days had filled their data PVCs with leaked staging and crashlooped on No space left. Both are the failure modes above.
Validation
bootstrap_continues_past_a_failing_artifact_and_reports_partial— a sibling artifact 500s mid-flight; asserts the bootstrap surfaces failure (for retry) and the healthy artifact is still applied rather than stranded.ensure_directories_creates_expected_layoutextended — seeds stale staging + a real data file; asserts the staging is reclaimed and the data preserved.cargo test --lib replication— 18 passed;cargo clippy --lib --tests+cargo fmtclean.
Deliberately not in this PR
- Serving-side 404 for not-yet-complete artifacts. A peer could return
404(client already treats it asIgnoredStaleand skips cleanly) for an artifact whose body it doesn’t yet hold, turning the noisy partial-stream failure into a clean skip. Lower value now that the client side both converges (fix 1) and no longer leaks (fix 2), and it needs care around the serving handler’s completeness semantics — left as a follow-up. - node-exporter
priorityClassName: system-node-critical(the DiskPressure-eviction wedge that blocked deploys) — infra/helm, separate area; the values key needs render-validation against the grafana/k8s-monitoring v4 chart with deps built.
No GitHub comments yet.