Hive
fix(server): stop canary 500s from web-pool/Oban contention on create_project
GitHub issue · Closed
What & why
Canary was returning HTTP 500s on DB-backed writes (POST /api/projects / tuist project create), ~999 DBConnection.ConnectionError in 6h on the can Sentry env, escalating after the Supabase to CNPG cutover (#11063). Production was unaffected.
Root cause
The 500s are app-side Tuist.Repo pool exhaustion, not a CNPG max_connections limit. Live evidence on canary:
- CNPG
max_connections = 100, only ~25/100 backends in use (15tuist_app+ 10tuist_processor), zero “too many clients” / “remaining connection slots” in 24h of Postgres logs. Postgres never refused a connection; sync replication healthy. - The canary web tier is a single replica whose 15-connection pool is shared between request handling and the in-node Oban
defaultqueue (concurrency 10,base_queuesinserver/config/runtime.exs). AutomationSchedulerfans out oneAlertEvaluationWorkerper enabled alert every ~1 min. Canary has ~2,555 enabled alerts (every project gets a default “Flaky test detection” alert viaseed_default_alert, cadence5m), so thedefaultqueue carries a continuous ~8.5 jobs/s. Under load the queue’s 10 slots consume most of the 15-pool, starving web checkouts, which time out after ~3.8s and surface as 500s.POST /api/projectsturned those timeouts into 500s because the controller only rescuedEcto.InvalidChangesetError, and the non-transactionalcreate_project!(Repo.insert!then a hard{:ok, _} = seed_default_alert(...)match) could also leave an orphaned project behind.
Why canary-only (and why not latency): in-cluster CNPG is actually faster per-query than external Supabase (~0.2ms SELECT). Production is unaffected because it stays on Supabase and runs 2-5 web replicas, spreading the same per-alert load across independent pools. Canary’s single replica absorbs the whole alert workload on one 15-pool.
Changes
Canary 500s fix
fix(infra)— canaryTUIST_DATABASE_POOL_SIZE15 → 30. The real constraint is the app pool vs its Oban consumers, notmax_connections, so size the pool above thedefaultqueue’s concurrency (10) to keep web headroom. 30 web + 10 processor + overhead stays well under 100.fix(server)— switchPOST /api/projectsto the transactionalProjects.create_project/2(Ecto.Multi: project insert + default-alert seed atomic).{:error, changeset}maps to 400 with the same message as before (API contract preserved); infra errors now surface as 5xx without orphaning a project.
CNPG production-cutover hardening
The investigation surfaced two forward-looking issues for when production cuts over to CNPG, so this PR also hardens the chart + runbook:
fix(infra)— explicitmax_connections. CNPG ran at the operator default of 100. That is the inverse of canary’s situation: production’s web tier autoscales to 5 replicas (5 × pool 15 = 75, plus processor, migration, and replication/monitoring/superuser overhead), which at peak/deploy-surge would exceed 100 and cause the Postgres-sidetoo many clientsrefusals canary never hit. Set it explicitly (default 100, production 200), and document the budget formula ininfra/cnpg/MIGRATION.md, including why a multi-replica env must not copy canary’s per-replica pool of 30 (5 × 30 = 150 > 100).fix(infra)— CNPGPooler(PgBouncer, transaction mode) for the processor. A transaction-mode pooler that fronts the cluster for the processor, so itsprepare: :unnamedconnection shape stays constant across the cutover (it matches the processor’s existing Supabase Supavisor:6543path). Gated onpostgresql.cnpg.pooler.enabled: enabled on canary and staging (which already run on CNPG) so the production-post-cutover topology soaks there first; off on production until its own cutover. Activation is just the flag — with no custom certificate secrets, CNPG’s built-in PgBouncer integration manages auth itself (creates thecnpg_pooler_pgbouncerrole +user_searchfunction in thepostgresdatabase on reconcile and issues the TLS cert), so there is no SQL bootstrap (the one caveat: don’t add custom cert secrets, or the built-in integration disables and you own auth).
How the topology differs from Supabase (and why)
On Supabase, both tiers go through Supavisor: the processor in transaction mode (:6543, prepare: :unnamed) and the web tier in session mode (*.pooler.supabase.com:5432, prepare: :named). This PR mirrors only the processor’s pooler and connects the web tier directly to the CNPG -rw Service instead of standing up a session-mode pooler in front of it.
That is a deliberate difference, not an oversight:
- The web tier runs Oban, whose PG notifier (LISTEN/NOTIFY) and Postgres-peer leader election (session advisory locks) do not survive transaction pooling. They do survive session pooling, so the constraint is transaction mode specifically — the web tier could be session-pooled, it just must never be transaction-pooled.
- Supabase routes the web tier through Supavisor session mode for Supabase-platform reasons — the direct Postgres endpoint is IPv4-inaccessible and Supabase wants a managed connection front door. Neither applies in-cluster.
- CNPG’s
-rwService is already the native session endpoint with primary failover (it re-points on promotion). A session-mode pooler in front of it would add a PgBouncer hop, change nothing about the connection budget (session pooling holds ~one backend per client connection, somax_connections: 200is unchanged), and duplicate what-rwalready provides. - The application’s connection shape is identical whether the web tier hits
-rwdirectly or a session pooler (both session mode,prepare: :named), so there is no cutover-shape risk from this difference — unlike the processor, where pooled-vs-direct flips named↔unnamed prepares, which is exactly why the processor is pooled.
So: processor pooled (transaction) to preserve its shape across cutover; web tier on -rw directly as the CNPG-native equivalent of Supabase’s session pooler.
Validation
mix test test/tuist_web/controllers/api/projects_controller_test.exs— 32/32 pass with the controller change; the full server test suite also passes in CI;mix format+mix credoclean.helm templateacross staging/canary/production renders cleanly. Canary and staging produce the transaction-modePoolerCR and route the processor to-pooler-rwwithTUIST_DATABASE_POOLED=1; production renders no Pooler (off until cutover).max_connectionsrenders 100 (staging/canary) and 200 (production); thesynchronousblock is preserved.- The pooler’s live behavior validates on canary/staging post-merge; CNPG manages the PgBouncer auth automatically, so enabling it is a flag flip with no separate bootstrap.
How to test locally
cd server
mix test test/tuist_web/controllers/api/projects_controller_test.exs
helm template tuist infra/helm/tuist -f infra/helm/tuist/values-managed-common.yaml -f infra/helm/tuist/values-managed-canary.yaml --set server.image.tag=t --set kuraController.image.tag=t --set runnersController.image.tag=t --set processor.image.tag=t --set xcresultProcessor.image.tag=t | grep -E "kind: Pooler|pooler-rw|max_connections"
No GitHub comments yet.