Hive
feat(infra): provision Postgres in-cluster via CNPG with /ops/db UI
GitHub issue · Closed
Implements the Moving Postgres In-Cluster RFC. Moves the managed Postgres data plane from Supabase to CloudNativePG on the workload clusters, and rebuilds the operator workflow pgweb used to cover as a /ops/db LiveView inside the authenticated /ops pipeline.
Provisioning + the full Supabase→CNPG replication soak have been run end-to-end on staging (see “Validated on staging” below); the mode: cnpg cutover itself stays disabled and ships per-env behind the runbook.
What changed
CloudNativePG plumbing
- Adds
cloudnative-pgas a subchart ofinfra/helm/platform/. The operator ships on every workload cluster via the same path that brings up cert-manager, external-dns, and ESO. No standalone install step. - Adds a
platform-installjob to.github/workflows/server-deployment.yml(mirrorsobservability-install/tailscale-operator-install) so platform CRDs roll on every deploy. - Renders a CNPG
Cluster+ScheduledBackup+ ESO-synced Secrets (Tigris backup credentials, per-managed-role passwords) underinfra/helm/tuist/templates/postgresql-cnpg.yaml. Per-env values set size (staging 2×100 GiB, canary/production 3×250 GiB) and the env Tigris bucket. Daily base backup at 03:00 UTC (0 0 3 * * *— CNPG’s cron is 6-field, leading seconds) plus continuous WAL archiving. - Two independent rollout knobs:
postgresql.cnpg.enabled(defaulttrueon managed envs) renders the cluster so the next deploy brings CNPG up alongside Supabase for the soak. Server pods unaffected.postgresql.mode == "cnpg"flips server / processor / migration-jobDATABASE_URLonto the cluster. Staysexternaluntil each env’s soak passes.
PodMonitoris left off on managed clusters (both the operator subchart and the cluster CR): they scrape via Grafana Alloy annotation discovery, not prometheus-operator, so themonitoring.coreos.comCRD set isn’t installed. Wiring CNPG metrics into Alloy is a follow-up (see below).
/ops/db LiveView
Tuist.Ops.Databasebacking module, read-only by construction: the ad-hoc SQL runner gates on a SELECT/WITH/EXPLAIN/SHOW grammar AND runs insideBEGIN READ ONLYwith a 5sstatement_timeout. SELECT/WITH execute through a server-side cursor (DECLARE … FETCH FORWARD limit+1) so aSELECT * FROM huge_tablecan’t materialize the whole result set into the BEAM. Stats helpers callEctoPSQLExtras/pg_stat_*.TuistWeb.OpsDatabaseLiveat/ops/db(Overview / Tables / Indexes / Activity / Replication / Backups / SQL). The Tables list excludes Postgres-internal, CNPG-operator, and TimescaleDB schemas. SQL tab has Cmd/Ctrl+Enter submit + Markdown/JSON/CSV export.- The Backups tab pairs the Postgres-side WAL-archiving view (
pg_stat_archiver) with a base-backup inventory read from the Kubernetes API:Tuist.Ops.Database.list_base_backups/0lists the cluster’s CNPGBackupCRs (per-snapshot phase / method / completion / error — the signalpg_stat_archivercan’t surface, e.g. awalArchivingFailingsnapshot next tocompletedones). A namespaced Role grants the server ServiceAccountget/listonpostgresql.cnpg.io/{backups,scheduledbackups}in the release namespace (rendered only when the cluster is). Degrades gracefully — when no cluster is provisioned (dev, or a self-host on embedded/external Postgres)TUIST_CNPG_NAMESPACEis unset and the tab shows an empty state instead of erroring; once the cluster is enabled it lists backups even during the pre-cutover soak (the env is wired whenevercnpg.enabledis true, independent ofmode). TuistWeb.OpsDatabaseTableLiveat/ops/db/tables/:schema/:name: paginated preview with single-column sort andNoora.Filterchips. URL identifiers are whitelisted against the table’s real columns before splicing; values bind as$N.- UUID-shaped 16-byte binaries render as canonical UUIDs; other non-UTF-8 binaries as
\x…hex.
Legacy-table cleanup
drop_legacy_postgres_tablesmigration removes tables the Elixir app no longer uses, so the in-cluster schema starts clean:build_runs(build analytics moved to ClickHouse; the lone TimescaleDB hypertable on Supabase),artifacts+bundles(ClickHouse bundle analytics),runner_assignments(replaced by runner_claims/sessions), and the Rails-eraque_*+ar_internal_metadata.IF EXISTS, no live FKs into the set. Dropping the hypertable lets the CNPG schema run without the timescaledb extension.
Restore-validation drill
.github/workflows/cnpg-restore-drill.yml: weekly (+ on-demand per env) it bootstraps an isolated recoveryClusterfrom the barman object store, waits for healthy, asserts the restored schema matches the live cluster and the data is present and within 10% of live, then tears it down. ReadsdestinationPath/endpointURL/imageName/storage.sizefrom the live cluster spec so the recovery config can’t drift per env. Catches backup bitrot (credential/retention/version regressions) that “archiving is healthy” can’t. Postgres data-plane only; ClickHouse + object-storage DR is separate.- Proven end-to-end on staging — run 26632268236: the full workflow bootstrapped a recovery
Clusterfrom the backup, reached healthy, asserted the restored schema + data against live, and tore the recovery cluster down — every step green. (workflow_dispatchis exposed by a minimal placeholder of this workflow merged tomainin #10987 — GitHub only surfaces dispatch for workflows on the default branch; the full workflow lives here and was dispatched against this branch.) - A failed run on
main(the weekly scheduled drill) pings the notifications Slack channel vianotify-failure.yml— the sameworkflow_runlistener the server deploys use. Teardown isif: always(), so a failure still tears its recovery cluster down rather than stranding it.
Bootstrap SQL + runbook
infra/cnpg/:tuist-processor-grants.sql+tuist-ops-ro-grants.sql(per-table grantsmanaged.roles[].inRolescan’t express),README.md(bring-up + rotation), andMIGRATION.md— the Supabase→CNPG runbook, rewritten from the staging dry run (schema built by running Ecto migrations astuist_app, notpg_dump; publication scoped to the CNPG∩Supabase table intersection;REPLICA IDENTITY FULLon PK-less tables; seed-data and unlogged-table handling; 72/48/72h soak gates incl. a passing restore drill; 14-day reverse replication for rollback).
Why
- Declarative role management — CNPG
managed.roles[]reconcilestuist_processor+tuist_ops_rofrom Secrets, no out-of-band SQL. - Connection topology — direct session-mode connections to the cluster’s
-rwService instead of Supavisor. - GDPR data-handling surface — Postgres + ClickHouse become one audited substrate again.
Validated on staging
Ran the full provisioning + soak against staging, which is the strongest validation here and shook out a string of real bugs this PR now fixes:
- CNPG cluster healthy (primary + sync replica); WAL archiving to Tigris OK after clearing stale objects from a prior rolled-back attempt; on-demand base backup
completedand PITR window established. - Restore drill passed end-to-end: a recovery cluster built from the backup matched the live data exactly (schema + row counts), including very recent writes — so the backup is provably restorable, not just produced.
- Replication soak running: 55-table publication (CNPG∩Supabase, excluding oban / dead
build_runs/ unloggedque_lockers/ Supabase-onlyrunner_assignments), all tables streaming at zero lag. - Bugs surfaced + fixed along the way: PodMonitor needs prometheus-operator CRDs absent on managed clusters; CNPG’s 6-field cron made base backups run hourly not daily; staging CNPG memory request too high for the cpx22 workers; Supabase carries TimescaleDB + Supabase system schemas + Rails-era tables that a
pg_dump-based schema copy would drag onto vanilla CNPG; Ecto seed migrations collide with logical replication’s initial COPY;miseSLSA verification 404s on github-backend tools in the deploy workflow.
Cutover plan (still disabled)
mode: cnpg ships off. Per env: provisioning lands on deploy → 72/48/72h zero-lag soak (gated on a passing restore drill + healthy cluster) → a separate small PR flips mode to cnpg. Reverse replication stays alive 14 days as the rollback gate. Full walkthrough in infra/cnpg/MIGRATION.md.
Follow-ups (not in this PR)
- Grafana alert on
pg_stat_archiver.failed_count > 0/ archive staleness. The Backups tab is the manual signal today. Prerequisite: CNPG metrics aren’t scraped yet (PodMonitor is off — see above), so this needs the scrape path established first (annotation discovery into Alloy, or aPrometheusRuleonce the CRDs exist), then alert rules routed to the existing IRM destination.
How to test locally
/ops/db works against the dev Postgres unchanged — it uses Tuist.Repo, and Tuist.Authorization.Checks.ops_access/2 short-circuits to true in dev.
cd server && mise run dev
# log in as tuistrocks@tuist.dev / tuistrocks, then visit http://localhost:8977/ops/db
mix test test/tuist/ops/database_test.exs # 22 tests, 0 failures
Chart validation:
helm dependency update infra/helm/platform && helm template platform infra/helm/platform >/dev/null
for env in staging canary production; do
helm template tuist infra/helm/tuist \
-f infra/helm/tuist/values-managed-common.yaml \
-f infra/helm/tuist/values-managed-$env.yaml \
--set server.image.tag=test --set kuraController.image.tag=test \
--set macosFleet.image.tag=test --set runnersController.image.tag=test \
--set xcresultProcessor.image.tag=test >/dev/null && echo "$env OK"
done
The restore drill can be dry-run with gh workflow run cnpg-restore-drill.yml -f environment=staging once an env has a live CNPG cluster.
🤖 Generated with Claude Code