Hive Hive
Sign in

fix(kura): harden segment disk accounting at rotation and startup

GitHub issue · Closed

Metadata
Source
tuist/tuist #11246
Updated
Jun 24, 2026
Domains
Kura
Details

Two disk-safety hardening fixes for the CAS segment ring, surfaced while stress-testing the margin assumptions behind #11222 (the disk-derived capacity work). Both matter more as that PR ships: larger rings mean far more rotations, so more chances to crash inside the rotation window and more traffic through the free-space guard.

What changed

  1. The rotation free-space check now scales with the incoming artifact. Previously it demanded a fixed MAX_SEGMENT_BYTES * SEGMENT_FREE_SPACE_MARGIN (1 GiB) no matter what was about to be written. The required bytes are now max(MAX_SEGMENT_BYTES, incoming_size) * SEGMENT_FREE_SPACE_MARGIN. Behavior is bit-for-bit identical for artifacts ≤ 512 MiB.
  2. Startup sweep for orphaned segment files. Store::sweep_orphaned_segments runs once at startup (after Store::open, before any traffic): it lists data_dir/segments/, and every .seg file not referenced by the persisted ring state is routed through the existing evict_segment path. A failed sweep logs a warning and does not block startup.

Root causes

Oversized artifacts under-provisioned the guard. Rotation triggers when current_size + incoming > MAX_SEGMENT_BYTES, and then the whole incoming artifact is appended into the fresh segment — segment files can legitimately exceed 512 MiB (e.g. a 2 GiB replication body, MAX_REPLICATION_BODY_BYTES). The rotation check is the only free-space checkpoint (appends between rotations are unchecked), so a 2 GiB append could pass a 1 GiB guard and hit raw ENOSPC mid-append — made worse by the fact that during append_to_segment the artifact’s bytes transiently exist twice when the tmp dir shares the filesystem (the staged source is deleted only after the copy completes). The fix keeps the same 2× proportionality the margin has always encoded: one width for the new segment’s guaranteed growth, one width of slack for unchecked co-writers.

Crash-window orphans had no reclamation path. Rotation is deliberately ordered for crash consistency: persist ring state without the evicted segment → append → commit metadata → unlink the evicted file. The safe direction of that ordering means a crash (or an error return) between the state save and the unlink strands the segment file — and the manifests and index entries of every artifact inside it — with nothing referencing them and no code path that will ever clean them up. Each such event leaks up to a segment’s worth of disk permanently. The repo had no startup sweep of the segments directory at all.

Why this shape

  • The sweep reuses evict_segment rather than just unlinking files: that path already (atomically, with the tested batch logic) deletes the stranded manifests, the namespace and segment index entries, and the cached FD handle — so the crash-window case cleans up metadata-consistently, never leaving manifests pointing at deleted bytes.
  • It is safe by construction: it runs under the DataDirLock writer lock (acquired before Store::open in app.rs), before any rotation can run, so it cannot race the legitimate “state references a segment whose file does not exist yet” window of a fresh rotation. The inverse direction (state entry without a file) is untouched, as is everything outside segments/ (blob files, RocksDB).
  • The required-bytes computation is extracted as a pure function so the policy is unit-testable.

Rollout safety

Node-local only: no on-disk format, wire format, or replication protocol changes. Segment files are still reclaimed exclusively by unlink (the mmap SIGBUS invariant in kura/AGENTS.md is upheld). Mixed-version meshes are unaffected. For existing deployments the sweep is a no-op unless a node has accumulated orphans, in which case it frees disk on the next restart.

Validation

  • New unit tests:
    • segment_rotation_requires_margin_for_oversized_artifacts — fixed floor for small/equal sizes, scaled requirement for oversized ones.
    • sweep_orphaned_segments_returns_zero_without_segments_dir — first-boot no-op.
    • sweep_orphaned_segments_removes_stray_files_and_keeps_live_segments — stray file deleted, live segment and its artifact untouched and still readable.
    • sweep_orphaned_segments_reclaims_crash_window_segment_and_metadata — simulates the exact crash window (ring state saved without the segment, file still on disk) and asserts both the file and the stranded manifest are reclaimed.
  • cargo test --lib — 238 passed, 0 failed; cargo clippy --all-targets -- -D warnings — clean; cargo fmt — no changes.

🤖 Generated with Claude Code

Comments
P
pepicrft Jun 12, 2026

One follow-up thought after sitting with this: it’s worth being clear that this PR makes the existing leaky design correct, but doesn’t restore the bounded-segment invariant the capacity model in #11222 leans on. MAX_SEGMENT_BYTES is still only a rotation trigger, not a per-write cap, so segments can legitimately be up to MAX_MODULE_TOTAL_BYTES / MAX_REPLICATION_BODY_BYTES. The ring is byte-unbounded and only count-bounded, and the new free-space formula is essentially a tax on that.

The sweep half I’d keep regardless of direction. Crash-window recovery is a separate concern (the rotation ordering here is already the safer of the two options, and a startup sweep is the right way to handle the inevitable leak). What’s worth revisiting is the free-space scaling and, behind it, the segment-size assumption.

Three ways to actually bound segments, in rough order of surgery:

  1. Raise MAX_SEGMENT_BYTES to match MAX_MODULE_TOTAL_BYTES (2 GiB). Rotation trigger == per-write cap == segment size. Simplest, no new code paths. Cost: coarser rotation, more tail-wasted bytes per segment, and the legacy 5-segment floor becomes 10 GiB minimum disk (today it’s 2.5 GiB). Probably fine for the deployment shape but worth checking against the smallest tier.
  2. Reject size > MAX_SEGMENT_BYTES at the API boundary and lower the advertised REAPI cap accordingly. Smallest patch surface, but a behavioral break for module uploads over 512 MiB.
  3. Dual pool: segment ring + standalone blobs with their own budget. ArtifactManifest.blob_path and every read path already handle the blob branch, so the read side is free. The work is on the write side (route size > MAX_SEGMENT_BYTES to a blob) plus a blob retention policy (LRU, size budget tied to the same disk-percent ceiling as the segment ring). Without that retention the spillover just relocates the unboundedness, so it’s the design choice that matters, not the wiring.

#1 is the cheapest “actually bounded” option; #3 is the most flexible. Either way, once segments are truly bounded the free-space check goes back to the original 1 GiB and segment_rotation_required_bytes can be deleted. Flagging for whoever picks up the followup — not blocking on this PR.