fix(infra,kura): restore Kura gRPC upload throughput (gateway + runtime HTTP/2 windows, activity-based timeout)

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11266

Updated

Jun 24, 2026

Domains

Kura

Details

What changed

Restores large-blob upload throughput to Kura over gRPC (REAPI ByteStream.Write). The regression has two stacked ceilings — the gateway nginx in front of Kura, and Kura’s own gRPC server — so the fix has two layers, plus a CI/deploy change that made the second layer testable on staging.

1. Gateway nginx HTTP/2 windows — raised on every Kura gateway, in both render paths:

Dedicated gateways (kgw-<account-hash>-<region>): the KuraGateway controller ConfigMap (gatewayNginxConfigData).
Regional gateways (kura-eu-central / kura-us-east / kura-us-west): the platform chart’s ingress-nginx controller.config.

client-body-buffer-size: "4m"                 # WINDOW_UPDATE pacing while streaming upstream
http2-max-concurrent-streams: "32"            # bounds worst-case nginx memory per connection
http-snippet: "http2_body_preread_size 4m;"   # initial advertised request-body window

2. Kura gRPC runtime (kura/src/reapi/mod.rs) — so the Kura hop isn’t the next bottleneck once nginx is fixed:

Raise the tonic/hyper HTTP/2 windows from their 1 MiB defaults: initial_stream_window_size = 4 MiB, initial_connection_window_size = 16 MiB.
Keep max_connection_age at 300s (so connections still recycle every 5 min and rebalance onto fresh pods after a rolling deploy) but raise max_connection_age_grace 300s → 900s. The age-boundary GOAWAY only stops new streams — an in-flight upload runs until the grace expires — so the 300s grace was the actual bug that cut large uploads at age+grace=600s. 900s comfortably outlasts the largest observed upload (~680s), and worst-case connection lifetime stays ~20 min (vs the 2h a 3600+3600 pair would let a slow-but-active stream hold a connection + temp-file slot).
Add a per-chunk stall timeout on the upload path: each received chunk resets a 60s timer, so a connection with data actively flowing is never cut, but a stalled/vanished client is reclaimed promptly (returns DEADLINE_EXCEEDED and removes the temp file). This is the “kill only on inactivity, not on activity” behavior — tonic has no activity-based connection timeout, so it’s enforced at the stream level where the upload actually happens.

3. Build + deploy the Kura runtime image from Server Deployment (.github/workflows/server-deployment.yml):

The workflow already built the server and kura-controller images per commit, but the kura runtime image is otherwise only published by release.yml on push-to-main. So deploying a branch to staging shipped the chart (layer 1) while the kura binary stayed pinned to the latest released kura@ tag — making layer 2 impossible to validate pre-release. (Confirmed in practice: a staging deploy of this branch left the pods on ghcr.io/tuist/kura:0.10.1 and the 784MB blobs still died at exactly 300+300s.)
Now the build job also builds ghcr.io/tuist/kura:sha-<SHA> (from source, kura/Dockerfile, mirroring the controller step). The deploy job defaults kuraRuntime.image.tag to that per-commit image on staging, while canary/production keep pinning the resolved kura@ release tag so they never run an unreleased binary. An explicit kura_runtime_image_tag input still overrides everywhere.

Why

Large Bazel REAPI uploads to Kura were stuck in an infinite retry loop. The client’s binary gRPC log showed every ByteStream.Write capped at ~310KB/s = nginx’s default 64KB HTTP/2 body window ÷ the 193ms RTT. At that rate the 784MB librocksdb-sys artifacts need ~42 min, but streams are force-closed at ~600s (Kura’s max_connection_age + grace), so Bazel restarts from byte 0 forever: Write → 502 at 600s → QueryWriteStatus → NOT_FOUND → restart.

Root cause of the window cap: regression from #11192, which moved Kura gRPC from per-instance TCP-passthrough LBs (TLS terminated by tonic, 1MB windows ≈ 5.4MB/s/stream) behind ingress-nginx gateways. That PR’s analysis covered aggregate NIC/CPU but not HTTP/2 per-stream flow control, and its smoke test exercised HTTP roundtrips, not large REAPI writes — so it shipped unnoticed. Both nginx knobs must move together: http2_body_preread_size sets the initial window, client_body_buffer_size (ingress default 8k) paces the WINDOW_UPDATEs once nginx relays the body.

The kura-runtime layer was added after a first staging run showed the nginx fix alone plateaus: per-stream throughput rose ~20x but capped at ~1.2MB effective window (Kura’s own 1 MiB tonic window on the nginx→kura hop), and the 784MB blob still died at the hard 600s max_connection_age. Raising Kura’s windows lifts the plateau; replacing the age-based kill with an inactivity timeout means active uploads finish regardless of size.

Real-infra validation (staging)

Two staging phases, against grpcs://grpc.tuist-eu-central-1-staging.kura.tuist.dev (RTT ~209ms, client in Brazil, 300 Mbps uplink):

Phase A — gateway fix only (pods still on released 0.10.1): per-stream upload rose from the ~0.3 MB/s baseline to ~6 MB/s solo, but plateaued at ~1.2MB effective window and the two 784MB blobs were severed at connection-age 600.9s (= 0.10.1’s 300+300), at 79% / 88%. This is what motivated layers 2 and 3.

Phase B — full fix deployed (pods on ghcr.io/tuist/kura:sha-518f3597e34c, the branch binary, via change #3): the regression is gone.

Metric	baseline (pre-fix)	Phase A (nginx fix only)	Phase B (nginx + runtime)
784MB blob outcome	∞ retry loop	502 at ~88%, died at 600s	✅ 100% in 167–223s, single stream
big-blob per-stream	~0.31 MB/s	~1.15 MB/s (before dying)	3.5–4.7 MB/s
both giants combined	—	~2.2 MB/s (failed)	7.0 MB/s
aggregate upload phase	—	4.0 MB/s	5.9 MB/s

In Phase B all 337 Writes succeeded with zero QueryWriteStatus — both ~784MB RocksDB artifacts uploaded to 100% in a single stream (784,550,880 in 167s, 783,930,496 in 223s), no connection-age teardown, no restart-from-zero.

Warm-cache run (everything already cached): 36s end-to-end, 803 GetActionResult hits + 98 Reads (90.7 MB, incl. the single 90.6 MB artifact) all OK, zero uploads, no auth errors — the happy path is fast and clean.

Local e2e validation

kura/test/e2e/grpc-upload-throughput/ — a toxiproxy + nginx + kura docker-compose harness that runs one kura backend behind baseline-vs-patched nginx configs (generated from the live chart values) under identical WAN latency, and asserts the patched path is ≥4x faster. Measured (16MB payload): at 192ms RTT, baseline 0.32 MB/s vs patched 12.24 MB/s (38.7x); baseline reproduces the production ~310 KB/s exactly. Runs as the gateway-throughput shard of the Kura e2e workflow. A controller unit test (TestGatewayNginxConfigMatchesChart) keeps the dedicated and regional render paths in sync.

Note: the e2e exercises a single stream against a bare fast kura, so it validates the nginx window in isolation and over-predicts; the staging runs are what surfaced the kura-side plateau (hence layer 2) and proved the end-to-end fix (hence layer 3).

Verification

infra/kura-controller go test ./... passes (incl. the chart-vs-controller equality assertion).
helm template renders the three window keys into each regional gateway ConfigMap (anchored to one source).
kura/src/reapi/mod.rs changes compile (cargo check --lib) and pass cargo clippy.
server-deployment.yml lints clean (actionlint); staging deploy confirmed the pods rolled onto kura:sha-518f3597e34c.

Deferred follow-up

Credential-token TTL vs. long-lived sessions. Now that uploads complete and builds run longer, this is the leading remaining symptom: in the full cold build, 1,089 of 2,886 Reads failed UNAUTHENTICATED in a single burst ~414s (~7 min) into the build — a fixed-TTL token expiring mid-invocation, not a Kura fault. Bazel doesn’t retry them, so those cache reads are lost (build still completes, with a degraded cache-hit rate where the burst lands). The warm-cache build (36s, under the TTL) never hits it. Fix is outside the upload path — longer-lived REAPI tokens or a Bazel credential helper that refreshes proactively / on UNAUTHENTICATED — and is tracked separately.

🤖 Generated with Claude Code

Comments

fortmarek Jun 16, 2026

Codex review findings:

[P2] Staging re-deploys with image_tag can deploy a non-existent Kura runtime image. The build job is skipped when inputs.image_tag is set, but staging still defaults kuraRuntime.image.tag to $IMAGE_TAG in .github/workflows/server-deployment.yml. Since the new ghcr.io/tuist/kura:sha-* image is only built inside the skipped build job, a manual staging re-roll of a prebuilt server image can send Kura into ImagePullBackOff unless the caller also knows to pass kura_runtime_image_tag. I’d only use $IMAGE_TAG for staging when this run actually built the runtime image; otherwise fall back to the resolved kura@ tag unless explicitly overridden.
[P2] ByteStream.Write still leaks partial temp files on immediate stream errors. The timeout branch removes temp_path, but Ok(result) => match result? in kura/src/reapi/mod.rs returns early on transport/cancel errors without cleanup. That is exactly the “client vanished” class this change is trying to reclaim; those partial files count toward ensure_tmp_dir_capacity, so repeated cancelled uploads can exhaust the tmp budget. Handle Err(status) explicitly and remove the temp file before returning.
[P3] The throughput harness docs/output are stale after the runtime-window change in this same PR. kura/test/e2e/grpc-upload-throughput/client/main.go and the harness README still describe direct_kura as tonic’s default ~1 MiB window, but Kura now advertises 4 MiB. Future readers will misinterpret the control path and measured ceilings.

Validation: I ran go test ./... in infra/kura-controller from a PR-head snapshot and it passed. I did not run the Rust compile/e2e throughput suite locally.