fix(kura): optimize artifact response streaming

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11124

Updated

Jun 24, 2026

Domains

Kura

Details

What changed

Added a same-port Linux accelerator for eligible public plaintext HTTP/1 artifact downloads.
The existing KURA_PORT listener peeks request headers with httparse before consuming bytes. Only known artifact GET routes that match the tenant, resolve to a local file-backed artifact, and pass extension access checks use the accelerated path.
The accelerator uses bounded blocking transfer workers with splice by default and sendfile available through KURA_ACCELERATED_FILE_SERVING_MODE.
Accelerated responses now keep HTTP/1.1 connections alive when safe. The accelerator only closes the socket when the client asks for it or when the request carries an unconsumed body. A reused connection can continue through the accelerator, and a later non-accelerable request is handed back to Axum/Hyper without consuming bytes.
Saturated accelerator capacity falls back to the normal Axum/Hyper path before request bytes are consumed, so the fast path is bounded without turning excess downloads into 503s.
Extension access checks on the accelerated path now run only after Kura knows the request can be served by the accelerator. Missing artifacts, inline artifacts, and file-open misses fall back to Axum without double-running extension auth.
Fallback HTTP artifact streams now keep public inflight accounting alive until the response body drops, matching the gRPC body-lifetime accounting model.
Replication bandwidth prioritization is now active by default with a 512 MiB/s per-node peer artifact body ceiling. The adaptive limiter backs replication off under public inflight load or elevated public request latency.
HTTPS, HTTP/2, non-GET requests, inline artifacts, unsupported routes, cold misses, non-Linux builds, and incomplete classifications keep using the existing Axum/Hyper path.
Kept the prior Kura hot-path optimizations in this PR: 1 MiB hot mmap chunks, concrete artifact readers, shared segment/blob handle caching, 512 KiB cold reader chunks, h2 tuning, multipart streaming, artifact egress metrics, and the reconciled bandwidth-prioritization work from #11115.
Added responsiveness telemetry so public latency pressure is sampled when a response is ready to start streaming rather than after a large body finishes. Large healthy downloads are still captured through inflight accounting, but they do not over-throttle replication by inflating latency.
Added Helm values and env wiring for the accelerator without adding any Service or container port.
Updated the Kura README and architecture docs to describe the same-port serving model and fallback behavior.
Removed the unrelated server/mix.lock dependency bump from the branch.

This PR intentionally does not include the earlier CLI/client-side streaming changes.

Why

Large artifact downloads can time out against Kura in customer environments. The old cache nodes get nginx X-Accel-Redirect style kernel file serving, while Kura previously served artifact bodies through Axum/Hyper only. The earlier mmap and chunking work reduced Hyper overhead, but it still could not fully explore nginx-style Linux file transfer primitives.

The selected solution keeps one public port and one deployment shape. It only takes over requests after conservative classification, file-backed artifact resolution, and authorization, and it falls back to Hyper while the stream is still untouched whenever the fast path is unsafe or at capacity. That gives us nginx-like kernel transfer for the narrow HTTP/1 plaintext artifact case without duplicating the whole application router or changing TLS/h2 behavior.

Safety model

Same public port: no KURA_ACCELERATED_FILE_SERVING_PORT, no extra Service port, no new production testing path.
Fallback before side effects: parse misses, route misses, tenant mismatches, artifact misses, inline artifacts, file-open misses, non-Linux runtime, h2/TLS requests, saturated accelerator capacity, and non-accelerable reused requests go through the existing router before the accelerator consumes headers.
Authorization before success: extension access checks still run before the accelerator consumes headers or writes a success response for requests it can serve itself.
Body-lifetime public accounting: fallback HTTP artifact downloads hold public inflight capacity until the response body drops, so adaptive replication backoff sees long-running HTTPS/h2/fallback transfers.
Bounded resources: KURA_ACCELERATED_FILE_SERVING_MAX_CONCURRENT bounds concurrent accelerated transfers, KURA_ACCELERATED_FILE_SERVING_CHUNK_BYTES bounds each kernel transfer call, and idle keep-alive connections have a bounded timeout.
Runtime kill switch: KURA_ACCELERATED_FILE_SERVING_ENABLED=false restores the previous public Axum server path on the same port.
Runtime mode switch: KURA_ACCELERATED_FILE_SERVING_MODE=splice|sendfile lets us choose the better Linux primitive without changing the deployment shape.
Replication remains bounded: peer artifact body traffic is capped at 512 MiB/s per node by default and adaptively divided under public load. Setting KURA_REPLICATION_BANDWIDTH_LIMIT_BYTES_PER_SECOND=0 disables that throttle explicitly.

Infrastructure check

The repo-managed Kura public path is not ingress-nginx:

Public Kura Services are Hetzner LoadBalancer Services in TCP passthrough mode.
TLS terminates in the Kura pod.
The Service uses externalTrafficPolicy=Local.
The public Service selector is pinned to one primary pod per region for read-your-writes consistency while replication is async.
There is no default kubernetes.io/ingress-bandwidth pod annotation, and tests assert that it stays absent.

So I did not find a repo-level bandwidth cap or nginx buffering layer in front of Kura. The important production implication is that public read throughput is intentionally one primary pod per region, not all replicas. Fanning the public LB across every pod would trade throughput for consistency unless reads are strongly routed to a pod that has the write.

gRPC

For the Tuist module-cache endpoint, gRPC would be a protocol migration rather than a serving optimization. It still runs over HTTP/2, and protobuf bytes responses require copying each chunk into a Vec<u8>. It is useful for Bazel REAPI ByteStream clients, but it is not a shortcut to nginx-style sendfile for current module artifact downloads.

Validation

Latest validation for Marek’s review follow-up:

git diff --check
PATH="$HOME/.cargo/bin:$PATH" cargo fmt --manifest-path kura/Cargo.toml
PATH="$HOME/.cargo/bin:$PATH" cargo test --manifest-path kura/Cargo.toml --lib
PATH="$HOME/.cargo/bin:$PATH" cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings

Previous validation on the branch also covered:

mise x rust@1.94.1 -- cargo fmt --manifest-path kura/Cargo.toml --check
mise x rust@1.94.1 -- cargo test --manifest-path kura/Cargo.toml
mise x rust@1.94.1 -- cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings
helm template kura kura/ops/helm/kura >/tmp/kura-helm-render.yaml
docker build -f kura/Dockerfile -t tuist-kura-bench:current kura

Helm render confirmed only the existing ports are exposed:

containerPort: 4000
containerPort: 50051
containerPort: 7443

Current Docker benchmark

Latest local Docker benchmark used image tuist-kura-bench:current, built from this branch after the keep-alive improvements. It measured three artifact sizes through the same public KURA_PORT:

4 KiB Xcode CAS artifact for small-object behavior.
17,941,067-byte Xcode CAS artifact matching the reported timeout payload size.
64 MiB Gradle artifact for large-object behavior. The Xcode CAS route caps uploads at 25 MiB, so the large case uses Gradle, which allows up to 100 MiB.

artifact	route	concurrency	mean
4 KiB	Xcode CAS	1	7.6 ms +/- 0.7
4 KiB	Xcode CAS	8	25.8 ms +/- 3.7
4 KiB	Xcode CAS	20	45.9 ms +/- 2.9
17,941,067 B	Xcode CAS	1	13.7 ms +/- 2.2
17,941,067 B	Xcode CAS	8	37.2 ms +/- 3.3
17,941,067 B	Xcode CAS	20	85.3 ms +/- 2.8
64 MiB	Gradle	1	16.0 ms +/- 3.0
64 MiB	Gradle	8	78.7 ms +/- 4.3
64 MiB	Gradle	20	184.1 ms +/- 3.7

Replication stress

I also ran a local three-node Kura mesh from the same Docker image with static peers and ring size 3.

Precheck:

Uploaded a 4 KiB Xcode CAS artifact to node A and verified it was readable from nodes B and C.
Uploaded a 64 MiB Gradle artifact to node A and verified it was readable from nodes B and C.
All nodes reported ready=true, ring_members=3, fd_timeout_count=0, and memory_pressure_state=0 before stress.

Stress workload per hyperfine run:

120 small reads across all three nodes.
45 large 64 MiB reads across all three nodes.
40 new 4 KiB writes into node A.
4 new 64 MiB Gradle writes into node A.
The script only exits after the final small and large writes are readable from both peers.

Result:

Mean: 6.976 s +/- 1.517 s across 3 runs.
All nodes stayed ready after the run.
fd_timeout_count=0 on all nodes.
memory_pressure_state=0 on all nodes.
Node A briefly reported outbox_messages=3 immediately after the run, then drained to 0 within 5 seconds. Nodes B and C stayed at 0.
Idle memory after the run was about 155 MiB on node A, 146 MiB on node B, and 148 MiB on node C.

Local Docker cannot reproduce production NIC, LB, or real client h2 behavior, so this is a serving-path and local replication stress benchmark, not a replacement for an in-cluster read-only benchmark.

Follow-up benchmark needed

The next benchmark should be read-only and in-cluster, without changing cluster state:

One test through the public Kura host.
One test to the same Kura primary pod through a direct pod or node-local path.
Compare HTTP/1.1 and HTTP/2/TLS, and record whether affected clients negotiate h2.
Measure pod CPU, process RSS, network transmit, TCP retransmits, h2 stream concurrency, and artifact egress metrics.
Run the same artifact sizes and concurrency 1, 8, and 20 so the results line up with the local Docker table.

Comments

fortmarek Jun 8, 2026

Findings from review:

[P2] Extension auth runs twice for accelerator fallbacks. In authorize_and_open, the extension evaluate_access hook runs before the manifest/file-backed checks. If those checks return Fallback, serve_connection hands the untouched request to Axum, where the extension middleware runs auth again. Missing artifacts, inline artifacts, or open failures can therefore double audit/rate-limit side effects or even get a different final decision. I would move auth after the “can actually accelerate” checks, or pass the auth result into fallback.
[P2] Fallback HTTP downloads are not counted as public inflight while streaming. track_http_metrics drops its guard when it returns the Response, but artifact bodies keep streaming afterward via InstrumentedArtifactStream, which does not hold a guard. That means HTTPS/HTTP2 or other fallback large downloads can be active while public_inflight() is already zero, so replication bandwidth adaptation will not back off for exactly those public reads. The gRPC layer fixed this by tying the guard to body drop; fallback HTTP needs the same treatment.
[P2] Public-over-replication prioritization is wired but disabled by default. The Helm value sets config.replication.bandwidthLimitBytesPerSecond: 0, and the config default is also 0. With that value, BandwidthLimiter::new(...) returns None, so every replication throttling call site skips acquire(...). That also means publicLatencyTargetMs: 100 is currently inert. If the goal is to ensure public traffic is preferred over node-to-node replication, we need to ship a positive replication bandwidth cap, and ideally fix fallback HTTP inflight accounting first so the adaptive divisor sees all public downloads.
[P3] Unrelated server lockfile change. server/mix.lock bumps deep_merge in a Kura-only PR. Unless intentional, I would remove it to avoid changing server dependency resolution in this fix.

pepicrft Jun 8, 2026

Addressed Marek’s review feedback in 44e043e92c:

Accelerator fallback now happens before extension auth for artifact misses, inline artifacts, and file-open misses, so those requests only run auth in the Axum path.
Fallback HTTP artifact bodies now hold public inflight accounting until the body drops.
Replication bandwidth prioritization is active by default with a 512 MiB/s per-node peer artifact cap.
Removed the unrelated server/mix.lock bump from the branch.

Validation:

git diff --check
PATH="$HOME/.cargo/bin:$PATH" cargo fmt --manifest-path kura/Cargo.toml
PATH="$HOME/.cargo/bin:$PATH" cargo test --manifest-path kura/Cargo.toml --lib
PATH="$HOME/.cargo/bin:$PATH" cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings

tuist-atlas[bot] Jun 9, 2026

The artifact response streaming optimization from this pull request is now available in kura@0.7.2. Update to this version to use the optimized artifact streaming path.

Docker image: ghcr.io/tuist/kura:0.7.2