What changed
- Added a same-port Linux accelerator for eligible public plaintext HTTP/1 artifact downloads.
- The existing
KURA_PORT listener peeks request headers with httparse before consuming bytes. Only known artifact GET routes that match the tenant, resolve to a local file-backed artifact, and pass extension access checks use the accelerated path.
- The accelerator uses bounded blocking transfer workers with
splice by default and sendfile available through KURA_ACCELERATED_FILE_SERVING_MODE.
- Accelerated responses now keep HTTP/1.1 connections alive when safe. The accelerator only closes the socket when the client asks for it or when the request carries an unconsumed body. A reused connection can continue through the accelerator, and a later non-accelerable request is handed back to Axum/Hyper without consuming bytes.
- Saturated accelerator capacity falls back to the normal Axum/Hyper path before request bytes are consumed, so the fast path is bounded without turning excess downloads into 503s.
- Extension access checks on the accelerated path now run only after Kura knows the request can be served by the accelerator. Missing artifacts, inline artifacts, and file-open misses fall back to Axum without double-running extension auth.
- Fallback HTTP artifact streams now keep public inflight accounting alive until the response body drops, matching the gRPC body-lifetime accounting model.
- Replication bandwidth prioritization is now active by default with a 512 MiB/s per-node peer artifact body ceiling. The adaptive limiter backs replication off under public inflight load or elevated public request latency.
- HTTPS, HTTP/2, non-GET requests, inline artifacts, unsupported routes, cold misses, non-Linux builds, and incomplete classifications keep using the existing Axum/Hyper path.
- Kept the prior Kura hot-path optimizations in this PR: 1 MiB hot mmap chunks, concrete artifact readers, shared segment/blob handle caching, 512 KiB cold reader chunks, h2 tuning, multipart streaming, artifact egress metrics, and the reconciled bandwidth-prioritization work from #11115.
- Added responsiveness telemetry so public latency pressure is sampled when a response is ready to start streaming rather than after a large body finishes. Large healthy downloads are still captured through inflight accounting, but they do not over-throttle replication by inflating latency.
- Added Helm values and env wiring for the accelerator without adding any Service or container port.
- Updated the Kura README and architecture docs to describe the same-port serving model and fallback behavior.
- Removed the unrelated
server/mix.lock dependency bump from the branch.
This PR intentionally does not include the earlier CLI/client-side streaming changes.
Why
Large artifact downloads can time out against Kura in customer environments. The old cache nodes get nginx X-Accel-Redirect style kernel file serving, while Kura previously served artifact bodies through Axum/Hyper only. The earlier mmap and chunking work reduced Hyper overhead, but it still could not fully explore nginx-style Linux file transfer primitives.
The selected solution keeps one public port and one deployment shape. It only takes over requests after conservative classification, file-backed artifact resolution, and authorization, and it falls back to Hyper while the stream is still untouched whenever the fast path is unsafe or at capacity. That gives us nginx-like kernel transfer for the narrow HTTP/1 plaintext artifact case without duplicating the whole application router or changing TLS/h2 behavior.
Safety model
- Same public port: no
KURA_ACCELERATED_FILE_SERVING_PORT, no extra Service port, no new production testing path.
- Fallback before side effects: parse misses, route misses, tenant mismatches, artifact misses, inline artifacts, file-open misses, non-Linux runtime, h2/TLS requests, saturated accelerator capacity, and non-accelerable reused requests go through the existing router before the accelerator consumes headers.
- Authorization before success: extension access checks still run before the accelerator consumes headers or writes a success response for requests it can serve itself.
- Body-lifetime public accounting: fallback HTTP artifact downloads hold public inflight capacity until the response body drops, so adaptive replication backoff sees long-running HTTPS/h2/fallback transfers.
- Bounded resources:
KURA_ACCELERATED_FILE_SERVING_MAX_CONCURRENT bounds concurrent accelerated transfers, KURA_ACCELERATED_FILE_SERVING_CHUNK_BYTES bounds each kernel transfer call, and idle keep-alive connections have a bounded timeout.
- Runtime kill switch:
KURA_ACCELERATED_FILE_SERVING_ENABLED=false restores the previous public Axum server path on the same port.
- Runtime mode switch:
KURA_ACCELERATED_FILE_SERVING_MODE=splice|sendfile lets us choose the better Linux primitive without changing the deployment shape.
- Replication remains bounded: peer artifact body traffic is capped at 512 MiB/s per node by default and adaptively divided under public load. Setting
KURA_REPLICATION_BANDWIDTH_LIMIT_BYTES_PER_SECOND=0 disables that throttle explicitly.
Infrastructure check
The repo-managed Kura public path is not ingress-nginx:
- Public Kura Services are Hetzner
LoadBalancer Services in TCP passthrough mode.
- TLS terminates in the Kura pod.
- The Service uses
externalTrafficPolicy=Local.
- The public Service selector is pinned to one primary pod per region for read-your-writes consistency while replication is async.
- There is no default
kubernetes.io/ingress-bandwidth pod annotation, and tests assert that it stays absent.
So I did not find a repo-level bandwidth cap or nginx buffering layer in front of Kura. The important production implication is that public read throughput is intentionally one primary pod per region, not all replicas. Fanning the public LB across every pod would trade throughput for consistency unless reads are strongly routed to a pod that has the write.
gRPC
For the Tuist module-cache endpoint, gRPC would be a protocol migration rather than a serving optimization. It still runs over HTTP/2, and protobuf bytes responses require copying each chunk into a Vec<u8>. It is useful for Bazel REAPI ByteStream clients, but it is not a shortcut to nginx-style sendfile for current module artifact downloads.
Validation
Latest validation for Marek’s review follow-up:
git diff --check
PATH="$HOME/.cargo/bin:$PATH" cargo fmt --manifest-path kura/Cargo.toml
PATH="$HOME/.cargo/bin:$PATH" cargo test --manifest-path kura/Cargo.toml --lib
PATH="$HOME/.cargo/bin:$PATH" cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings
Previous validation on the branch also covered:
mise x rust@1.94.1 -- cargo fmt --manifest-path kura/Cargo.toml --check
mise x rust@1.94.1 -- cargo test --manifest-path kura/Cargo.toml
mise x rust@1.94.1 -- cargo clippy --manifest-path kura/Cargo.toml --all-targets -- -D warnings
helm template kura kura/ops/helm/kura >/tmp/kura-helm-render.yaml
docker build -f kura/Dockerfile -t tuist-kura-bench:current kura
Helm render confirmed only the existing ports are exposed:
containerPort: 4000
containerPort: 50051
containerPort: 7443
Current Docker benchmark
Latest local Docker benchmark used image tuist-kura-bench:current, built from this branch after the keep-alive improvements. It measured three artifact sizes through the same public KURA_PORT:
- 4 KiB Xcode CAS artifact for small-object behavior.
- 17,941,067-byte Xcode CAS artifact matching the reported timeout payload size.
- 64 MiB Gradle artifact for large-object behavior. The Xcode CAS route caps uploads at 25 MiB, so the large case uses Gradle, which allows up to 100 MiB.
| artifact |
route |
concurrency |
mean |
| 4 KiB |
Xcode CAS |
1 |
7.6 ms +/- 0.7 |
| 4 KiB |
Xcode CAS |
8 |
25.8 ms +/- 3.7 |
| 4 KiB |
Xcode CAS |
20 |
45.9 ms +/- 2.9 |
| 17,941,067 B |
Xcode CAS |
1 |
13.7 ms +/- 2.2 |
| 17,941,067 B |
Xcode CAS |
8 |
37.2 ms +/- 3.3 |
| 17,941,067 B |
Xcode CAS |
20 |
85.3 ms +/- 2.8 |
| 64 MiB |
Gradle |
1 |
16.0 ms +/- 3.0 |
| 64 MiB |
Gradle |
8 |
78.7 ms +/- 4.3 |
| 64 MiB |
Gradle |
20 |
184.1 ms +/- 3.7 |
Replication stress
I also ran a local three-node Kura mesh from the same Docker image with static peers and ring size 3.
Precheck:
- Uploaded a 4 KiB Xcode CAS artifact to node A and verified it was readable from nodes B and C.
- Uploaded a 64 MiB Gradle artifact to node A and verified it was readable from nodes B and C.
- All nodes reported
ready=true, ring_members=3, fd_timeout_count=0, and memory_pressure_state=0 before stress.
Stress workload per hyperfine run:
- 120 small reads across all three nodes.
- 45 large 64 MiB reads across all three nodes.
- 40 new 4 KiB writes into node A.
- 4 new 64 MiB Gradle writes into node A.
- The script only exits after the final small and large writes are readable from both peers.
Result:
- Mean:
6.976 s +/- 1.517 s across 3 runs.
- All nodes stayed ready after the run.
fd_timeout_count=0 on all nodes.
memory_pressure_state=0 on all nodes.
- Node A briefly reported
outbox_messages=3 immediately after the run, then drained to 0 within 5 seconds. Nodes B and C stayed at 0.
- Idle memory after the run was about 155 MiB on node A, 146 MiB on node B, and 148 MiB on node C.
Local Docker cannot reproduce production NIC, LB, or real client h2 behavior, so this is a serving-path and local replication stress benchmark, not a replacement for an in-cluster read-only benchmark.
Follow-up benchmark needed
The next benchmark should be read-only and in-cluster, without changing cluster state:
- One test through the public Kura host.
- One test to the same Kura primary pod through a direct pod or node-local path.
- Compare HTTP/1.1 and HTTP/2/TLS, and record whether affected clients negotiate h2.
- Measure pod CPU, process RSS, network transmit, TCP retransmits, h2 stream concurrency, and artifact egress metrics.
- Run the same artifact sizes and concurrency 1, 8, and 20 so the results line up with the local Docker table.