Hive Hive
Sign in

fix(server): stop canary 500s from web-pool/Oban contention on create_project

GitHub issue · Closed

Metadata
Source
tuist/tuist #11107
Updated
Jun 24, 2026
Details

What & why

Canary was returning HTTP 500s on DB-backed writes (POST /api/projects / tuist project create), ~999 DBConnection.ConnectionError in 6h on the can Sentry env, escalating after the Supabase to CNPG cutover (#11063). Production was unaffected.

Root cause

The 500s are app-side Tuist.Repo pool exhaustion, not a CNPG max_connections limit. Live evidence on canary:

  • CNPG max_connections = 100, only ~25/100 backends in use (15 tuist_app + 10 tuist_processor), zero “too many clients” / “remaining connection slots” in 24h of Postgres logs. Postgres never refused a connection; sync replication healthy.
  • The canary web tier is a single replica whose 15-connection pool is shared between request handling and the in-node Oban default queue (concurrency 10, base_queues in server/config/runtime.exs).
  • AutomationScheduler fans out one AlertEvaluationWorker per enabled alert every ~1 min. Canary has ~2,555 enabled alerts (every project gets a default “Flaky test detection” alert via seed_default_alert, cadence 5m), so the default queue carries a continuous ~8.5 jobs/s. Under load the queue’s 10 slots consume most of the 15-pool, starving web checkouts, which time out after ~3.8s and surface as 500s.
  • POST /api/projects turned those timeouts into 500s because the controller only rescued Ecto.InvalidChangesetError, and the non-transactional create_project! (Repo.insert! then a hard {:ok, _} = seed_default_alert(...) match) could also leave an orphaned project behind.

Why canary-only (and why not latency): in-cluster CNPG is actually faster per-query than external Supabase (~0.2ms SELECT). Production is unaffected because it stays on Supabase and runs 2-5 web replicas, spreading the same per-alert load across independent pools. Canary’s single replica absorbs the whole alert workload on one 15-pool.

Changes

Canary 500s fix

  1. fix(infra) — canary TUIST_DATABASE_POOL_SIZE 15 → 30. The real constraint is the app pool vs its Oban consumers, not max_connections, so size the pool above the default queue’s concurrency (10) to keep web headroom. 30 web + 10 processor + overhead stays well under 100.
  2. fix(server) — switch POST /api/projects to the transactional Projects.create_project/2 (Ecto.Multi: project insert + default-alert seed atomic). {:error, changeset} maps to 400 with the same message as before (API contract preserved); infra errors now surface as 5xx without orphaning a project.

CNPG production-cutover hardening

The investigation surfaced two forward-looking issues for when production cuts over to CNPG, so this PR also hardens the chart + runbook:

  1. fix(infra) — explicit max_connections. CNPG ran at the operator default of 100. That is the inverse of canary’s situation: production’s web tier autoscales to 5 replicas (5 × pool 15 = 75, plus processor, migration, and replication/monitoring/superuser overhead), which at peak/deploy-surge would exceed 100 and cause the Postgres-side too many clients refusals canary never hit. Set it explicitly (default 100, production 200), and document the budget formula in infra/cnpg/MIGRATION.md, including why a multi-replica env must not copy canary’s per-replica pool of 30 (5 × 30 = 150 > 100).
  2. fix(infra) — CNPG Pooler (PgBouncer, transaction mode) for the processor. A transaction-mode pooler that fronts the cluster for the processor, so its prepare: :unnamed connection shape stays constant across the cutover (it matches the processor’s existing Supabase Supavisor :6543 path). Gated on postgresql.cnpg.pooler.enabled: enabled on canary and staging (which already run on CNPG) so the production-post-cutover topology soaks there first; off on production until its own cutover. Activation is just the flag — with no custom certificate secrets, CNPG’s built-in PgBouncer integration manages auth itself (creates the cnpg_pooler_pgbouncer role + user_search function in the postgres database on reconcile and issues the TLS cert), so there is no SQL bootstrap (the one caveat: don’t add custom cert secrets, or the built-in integration disables and you own auth).
How the topology differs from Supabase (and why)

On Supabase, both tiers go through Supavisor: the processor in transaction mode (:6543, prepare: :unnamed) and the web tier in session mode (*.pooler.supabase.com:5432, prepare: :named). This PR mirrors only the processor’s pooler and connects the web tier directly to the CNPG -rw Service instead of standing up a session-mode pooler in front of it.

That is a deliberate difference, not an oversight:

  • The web tier runs Oban, whose PG notifier (LISTEN/NOTIFY) and Postgres-peer leader election (session advisory locks) do not survive transaction pooling. They do survive session pooling, so the constraint is transaction mode specifically — the web tier could be session-pooled, it just must never be transaction-pooled.
  • Supabase routes the web tier through Supavisor session mode for Supabase-platform reasons — the direct Postgres endpoint is IPv4-inaccessible and Supabase wants a managed connection front door. Neither applies in-cluster.
  • CNPG’s -rw Service is already the native session endpoint with primary failover (it re-points on promotion). A session-mode pooler in front of it would add a PgBouncer hop, change nothing about the connection budget (session pooling holds ~one backend per client connection, so max_connections: 200 is unchanged), and duplicate what -rw already provides.
  • The application’s connection shape is identical whether the web tier hits -rw directly or a session pooler (both session mode, prepare: :named), so there is no cutover-shape risk from this difference — unlike the processor, where pooled-vs-direct flips named↔unnamed prepares, which is exactly why the processor is pooled.

So: processor pooled (transaction) to preserve its shape across cutover; web tier on -rw directly as the CNPG-native equivalent of Supabase’s session pooler.

Validation

  • mix test test/tuist_web/controllers/api/projects_controller_test.exs — 32/32 pass with the controller change; the full server test suite also passes in CI; mix format + mix credo clean.
  • helm template across staging/canary/production renders cleanly. Canary and staging produce the transaction-mode Pooler CR and route the processor to -pooler-rw with TUIST_DATABASE_POOLED=1; production renders no Pooler (off until cutover). max_connections renders 100 (staging/canary) and 200 (production); the synchronous block is preserved.
  • The pooler’s live behavior validates on canary/staging post-merge; CNPG manages the PgBouncer auth automatically, so enabling it is a flag flip with no separate bootstrap.

How to test locally

cd server
mix test test/tuist_web/controllers/api/projects_controller_test.exs
helm template tuist infra/helm/tuist -f infra/helm/tuist/values-managed-common.yaml -f infra/helm/tuist/values-managed-canary.yaml --set server.image.tag=t --set kuraController.image.tag=t --set runnersController.image.tag=t --set processor.image.tag=t --set xcresultProcessor.image.tag=t | grep -E "kind: Pooler|pooler-rw|max_connections"
Comments

No GitHub comments yet.