Hive Hive
Sign in

feat(infra): provision Postgres in-cluster via CNPG with /ops/db UI

GitHub issue · Closed

Metadata
Source
tuist/tuist #10942
Updated
Jun 24, 2026
Details

Implements the Moving Postgres In-Cluster RFC. Moves the managed Postgres data plane from Supabase to CloudNativePG on the workload clusters, and rebuilds the operator workflow pgweb used to cover as a /ops/db LiveView inside the authenticated /ops pipeline.

Provisioning + the full Supabase→CNPG replication soak have been run end-to-end on staging (see “Validated on staging” below); the mode: cnpg cutover itself stays disabled and ships per-env behind the runbook.

What changed

CloudNativePG plumbing

  • Adds cloudnative-pg as a subchart of infra/helm/platform/. The operator ships on every workload cluster via the same path that brings up cert-manager, external-dns, and ESO. No standalone install step.
  • Adds a platform-install job to .github/workflows/server-deployment.yml (mirrors observability-install / tailscale-operator-install) so platform CRDs roll on every deploy.
  • Renders a CNPG Cluster + ScheduledBackup + ESO-synced Secrets (Tigris backup credentials, per-managed-role passwords) under infra/helm/tuist/templates/postgresql-cnpg.yaml. Per-env values set size (staging 2×100 GiB, canary/production 3×250 GiB) and the env Tigris bucket. Daily base backup at 03:00 UTC (0 0 3 * * * — CNPG’s cron is 6-field, leading seconds) plus continuous WAL archiving.
  • Two independent rollout knobs:
    • postgresql.cnpg.enabled (default true on managed envs) renders the cluster so the next deploy brings CNPG up alongside Supabase for the soak. Server pods unaffected.
    • postgresql.mode == "cnpg" flips server / processor / migration-job DATABASE_URL onto the cluster. Stays external until each env’s soak passes.
  • PodMonitor is left off on managed clusters (both the operator subchart and the cluster CR): they scrape via Grafana Alloy annotation discovery, not prometheus-operator, so the monitoring.coreos.com CRD set isn’t installed. Wiring CNPG metrics into Alloy is a follow-up (see below).

/ops/db LiveView

  • Tuist.Ops.Database backing module, read-only by construction: the ad-hoc SQL runner gates on a SELECT/WITH/EXPLAIN/SHOW grammar AND runs inside BEGIN READ ONLY with a 5s statement_timeout. SELECT/WITH execute through a server-side cursor (DECLARE … FETCH FORWARD limit+1) so a SELECT * FROM huge_table can’t materialize the whole result set into the BEAM. Stats helpers call EctoPSQLExtras / pg_stat_*.
  • TuistWeb.OpsDatabaseLive at /ops/db (Overview / Tables / Indexes / Activity / Replication / Backups / SQL). The Tables list excludes Postgres-internal, CNPG-operator, and TimescaleDB schemas. SQL tab has Cmd/Ctrl+Enter submit + Markdown/JSON/CSV export.
  • The Backups tab pairs the Postgres-side WAL-archiving view (pg_stat_archiver) with a base-backup inventory read from the Kubernetes API: Tuist.Ops.Database.list_base_backups/0 lists the cluster’s CNPG Backup CRs (per-snapshot phase / method / completion / error — the signal pg_stat_archiver can’t surface, e.g. a walArchivingFailing snapshot next to completed ones). A namespaced Role grants the server ServiceAccount get/list on postgresql.cnpg.io/{backups,scheduledbackups} in the release namespace (rendered only when the cluster is). Degrades gracefully — when no cluster is provisioned (dev, or a self-host on embedded/external Postgres) TUIST_CNPG_NAMESPACE is unset and the tab shows an empty state instead of erroring; once the cluster is enabled it lists backups even during the pre-cutover soak (the env is wired whenever cnpg.enabled is true, independent of mode).
  • TuistWeb.OpsDatabaseTableLive at /ops/db/tables/:schema/:name: paginated preview with single-column sort and Noora.Filter chips. URL identifiers are whitelisted against the table’s real columns before splicing; values bind as $N.
  • UUID-shaped 16-byte binaries render as canonical UUIDs; other non-UTF-8 binaries as \x… hex.

Legacy-table cleanup

  • drop_legacy_postgres_tables migration removes tables the Elixir app no longer uses, so the in-cluster schema starts clean: build_runs (build analytics moved to ClickHouse; the lone TimescaleDB hypertable on Supabase), artifacts + bundles (ClickHouse bundle analytics), runner_assignments (replaced by runner_claims/sessions), and the Rails-era que_* + ar_internal_metadata. IF EXISTS, no live FKs into the set. Dropping the hypertable lets the CNPG schema run without the timescaledb extension.

Restore-validation drill

  • .github/workflows/cnpg-restore-drill.yml: weekly (+ on-demand per env) it bootstraps an isolated recovery Cluster from the barman object store, waits for healthy, asserts the restored schema matches the live cluster and the data is present and within 10% of live, then tears it down. Reads destinationPath / endpointURL / imageName / storage.size from the live cluster spec so the recovery config can’t drift per env. Catches backup bitrot (credential/retention/version regressions) that “archiving is healthy” can’t. Postgres data-plane only; ClickHouse + object-storage DR is separate.
  • Proven end-to-end on staging — run 26632268236: the full workflow bootstrapped a recovery Cluster from the backup, reached healthy, asserted the restored schema + data against live, and tore the recovery cluster down — every step green. (workflow_dispatch is exposed by a minimal placeholder of this workflow merged to main in #10987 — GitHub only surfaces dispatch for workflows on the default branch; the full workflow lives here and was dispatched against this branch.)
  • A failed run on main (the weekly scheduled drill) pings the notifications Slack channel via notify-failure.yml — the same workflow_run listener the server deploys use. Teardown is if: always(), so a failure still tears its recovery cluster down rather than stranding it.

Bootstrap SQL + runbook

  • infra/cnpg/: tuist-processor-grants.sql + tuist-ops-ro-grants.sql (per-table grants managed.roles[].inRoles can’t express), README.md (bring-up + rotation), and MIGRATION.md — the Supabase→CNPG runbook, rewritten from the staging dry run (schema built by running Ecto migrations as tuist_app, not pg_dump; publication scoped to the CNPG∩Supabase table intersection; REPLICA IDENTITY FULL on PK-less tables; seed-data and unlogged-table handling; 72/48/72h soak gates incl. a passing restore drill; 14-day reverse replication for rollback).

Why

  • Declarative role management — CNPG managed.roles[] reconciles tuist_processor + tuist_ops_ro from Secrets, no out-of-band SQL.
  • Connection topology — direct session-mode connections to the cluster’s -rw Service instead of Supavisor.
  • GDPR data-handling surface — Postgres + ClickHouse become one audited substrate again.

Validated on staging

Ran the full provisioning + soak against staging, which is the strongest validation here and shook out a string of real bugs this PR now fixes:

  • CNPG cluster healthy (primary + sync replica); WAL archiving to Tigris OK after clearing stale objects from a prior rolled-back attempt; on-demand base backup completed and PITR window established.
  • Restore drill passed end-to-end: a recovery cluster built from the backup matched the live data exactly (schema + row counts), including very recent writes — so the backup is provably restorable, not just produced.
  • Replication soak running: 55-table publication (CNPG∩Supabase, excluding oban / dead build_runs / unlogged que_lockers / Supabase-only runner_assignments), all tables streaming at zero lag.
  • Bugs surfaced + fixed along the way: PodMonitor needs prometheus-operator CRDs absent on managed clusters; CNPG’s 6-field cron made base backups run hourly not daily; staging CNPG memory request too high for the cpx22 workers; Supabase carries TimescaleDB + Supabase system schemas + Rails-era tables that a pg_dump-based schema copy would drag onto vanilla CNPG; Ecto seed migrations collide with logical replication’s initial COPY; mise SLSA verification 404s on github-backend tools in the deploy workflow.

Cutover plan (still disabled)

mode: cnpg ships off. Per env: provisioning lands on deploy → 72/48/72h zero-lag soak (gated on a passing restore drill + healthy cluster) → a separate small PR flips mode to cnpg. Reverse replication stays alive 14 days as the rollback gate. Full walkthrough in infra/cnpg/MIGRATION.md.

Follow-ups (not in this PR)

  • Grafana alert on pg_stat_archiver.failed_count > 0 / archive staleness. The Backups tab is the manual signal today. Prerequisite: CNPG metrics aren’t scraped yet (PodMonitor is off — see above), so this needs the scrape path established first (annotation discovery into Alloy, or a PrometheusRule once the CRDs exist), then alert rules routed to the existing IRM destination.

How to test locally

/ops/db works against the dev Postgres unchanged — it uses Tuist.Repo, and Tuist.Authorization.Checks.ops_access/2 short-circuits to true in dev.

cd server && mise run dev
# log in as tuistrocks@tuist.dev / tuistrocks, then visit http://localhost:8977/ops/db
mix test test/tuist/ops/database_test.exs # 22 tests, 0 failures

Chart validation:

helm dependency update infra/helm/platform && helm template platform infra/helm/platform >/dev/null
for env in staging canary production; do
helm template tuist infra/helm/tuist \
-f infra/helm/tuist/values-managed-common.yaml \
-f infra/helm/tuist/values-managed-$env.yaml \
--set server.image.tag=test --set kuraController.image.tag=test \
--set macosFleet.image.tag=test --set runnersController.image.tag=test \
--set xcresultProcessor.image.tag=test >/dev/null && echo "$env OK"
done

The restore drill can be dry-run with gh workflow run cnpg-restore-drill.yml -f environment=staging once an env has a live CNPG cluster.

🤖 Generated with Claude Code

Comments
TA
tuist-atlas[bot] Jun 5, 2026

The changes from this PR are now available in release xcresult-processor-image@0.11.0. Postgres is now provisioned in-cluster via CNPG with the /ops/db UI.