Hive
fix(server): unblock recent bucket production deploy
GitHub issue · Closed
What changed
- Reworked the
20260515100000ClickHouse migration so the 100/250/500/750 recent-run bucket tables are backfilled from the existingtest_case_runs_recent_per_caseaggregate instead of rescanning rawtest_case_runspartitions. - Kept the backfill chunked by project and retained an immediate per-project fallback if ClickHouse still reports memory pressure for a chunk.
- Mirrored the Kura introspection OAuth keys into
server-external-secretswhen the managed deployment uses the newerkura-shared-secretspath, preserving compatibility with old server ReplicaSets during failed or partial upgrades.
Why
The previous chunking fix changed the failure mode but did not unblock production: the migration continued scanning historical raw rows for each smaller bucket, so Helm hit its pre-upgrade hook timeout while the job was still backfilling. Because the migration drops and recreates the bucket tables at the start, every retry could restart from zero.
The same failed upgrade also exposed a secret compatibility gap. New server pods read Kura OAuth credentials from kura-shared-secrets, but old ReplicaSet pods still referenced server-external-secrets. The pre-upgrade ExternalSecret hook had stopped writing those legacy keys, so old pods could get stuck in CreateContainerConfigError while Helm was still trying to complete the release.
Impact
The migration now reshapes the already-maintained 1000-run per-case aggregate into smaller sorted bucket states, avoiding the raw historical table scan that was blowing through deploy time. The server ExternalSecret remains backward-compatible long enough for old pods to roll away cleanly.
Validation
helm templatefor production and canary managed values, confirmingserver-external-secretsnow containsKURA_CONTROL_PLANE_CLIENT_IDandKURA_CONTROL_PLANE_CLIENT_SECRETfromkura-introspection-oauth-clientwhile new server pods still referencekura-shared-secrets.helm lintfor production and canary managed values.clickhouse localsmoke test using the exact aggregate-state backfill query against sample data.mix format --check-formatted priv/ingest_repo/migrations/20260515100000_create_test_case_runs_recent_buckets_per_case_mvs.exsafter fetching deps; local lockfile refresh was restored before committing.- Elixir syntax parse and standard formatter check for the changed migration file.