Hive Hive
Sign in

fix(server): bound oban_jobs growth (root cause of prod DB saturation / stuck deploys)

GitHub issue · Closed

Metadata
Source
tuist/tuist #11336
Updated
Jun 24, 2026
Details

Draft. Covers the root-cause fix (Oban retention), the dashboard-signal corrections, and the query-latency observability that was missing during the incident. The one-time prune of the existing backlog has already been run (see “Operational remediation”).

Summary

The hosted server’s Postgres primary has been progressively overloaded since ~June 11, culminating today in a production deploy that stalled for over an hour. The root cause is the oban_jobs table growing to ~7.4M rows, which turned routine Oban queries into multi-million-row sequential scans and amplified write I/O on the table’s GIN indexes. This PR bounds that growth.

Symptom

A Server Production Deployment run sat with Tag fleet and runtime image releases and Publish releases stuck in queued on the tuist-linux self-hosted label, with no runner ever assigned. The whole fleet was affected: every tuist-linux / tuist-macos job across all repos queued, while GitHub-hosted ubuntu-latest jobs ran fine.

Investigation (how we got from “stuck deploy” to “oban_jobs”)

  1. The self-hosted runners are healthy and polling /api/internal/runners/dispatch, but the server had no work to hand them: its dispatch queue read 0 while GitHub showed many queued jobs.
  2. The workflow_job(queued) GitHub webhook is the only path that records a dispatchable job. The handler enqueues a DispatchWorker via Oban.insert (a Postgres write). When that write is slow or fails, the controller returns non-2xx (or the request times out), GitHub records a failed delivery, and no job is recorded.
  3. The only recovery, WebhookRedeliveryWorker, runs every 5 minutes with a 15-minute lookback, and only re-fires deliveries GitHub recorded as failed. A queued delivery that keeps failing past that window is orphaned permanently: the every-minute OrphanedRunnersWorker can’t help, because it only reconciles rows already in status='running', and a never-delivered queued created no row to reconcile. That is why the deploy’s jobs never dispatched — and, structurally, why even a single dropped queued delivery (not only the mass drops this incident caused) latches a run into a stuck state regardless of DB health. See follow-up 1.
  4. The webhook write was failing because the web app’s Ecto pool (about 15 connections per replica) was backing up behind slow queries (DBConnection ... dropped from queue after ~3.5s). This is an application-pool checkout timeout, not a Postgres max_connections rejection.
  5. The slow queries trace to oban_jobs.

Root cause

Oban.Plugins.Pruner was configured with max_age: 7 days. At the platform’s job-creation rate that retains ~7.4M rows (99.98% in completed state; AlertEvaluationWorker, enqueued per test-run ingestion, is ~65% of them). On that table:

  • Reads: ~3.1M sequential scans reading ~4.72 trillion rows, about 94% of all rows read in the database. As the table grows each scan reads more, hence the steady climb in read load.
  • Writes: ~22.5M updates with 0% HOT (Oban indexes state, so every state transition is a non-HOT update that rewrites all four indexes, including a 1.5 GB args GIN and a 738 MB state index). oban_jobs is ~94% of all row writes in the database, and the per-write I/O grows as the indexes bloat.

This single cause explains the read climb, the write/transaction-rate climb, the connection-pool exhaustion, and the dropped webhooks.

The Postgres connection ceiling is NOT the cause: the tuist cluster runs at ~44% of max_connections (89 / 200). The “94%” connection gauge in the CNPG dashboard aggregates all CNPG clusters in the cluster, which is misleading when filtered to one.

How the database and replicas work currently (for context)

  • The hosted server uses an in-cluster CloudNativePG cluster, tuist-tuist-pg, with three instances: tuist-tuist-pg-1/2/3.
  • pg-1 is currently the primary. pg-2 and pg-3 are streaming replicas with sync_state = quorum, i.e. replication is synchronous: every commit on the primary waits for a replica to flush before returning. Under the heavy oban_jobs write load this amplifies commit latency (and is why the operational prune below has to be paced against flush_lag).
  • The application is primary-only. There is a single Tuist.Repo with no read-replica configuration, so pg-2/pg-3 serve no application reads; they are HA standbys. (The processor role connects through the -pooler-rw pooler; the web role connects to the primary directly. Neither reads from a replica.)
  • Around June 14 ~11:00 the primary role moved from pg-3 to pg-1 (a failover; pg-3 was recreated as a replica at that time). Because the app only ever talks to whichever instance is primary, all load simply followed the role to pg-1. The failover was most likely triggered by the load, and it did not help, because promoting a new primary does not shrink oban_jobs. This is the step-change visible in the “pooler latency since June 14” and “activity moved off pg-3” observations.

What this PR changes

  1. Oban.Plugins.Pruner max_age 7 days to 2 hours (higher per-run limit), and validates alert cadence to Ns/Nm/Nh and ≤ 1h in the Alert changeset. Completed jobs only need to outlive the AutomationScheduler per-alert dedup window (it won’t re-enqueue an evaluation that ran within the alert’s cadence). That dedup keys off completed Oban rows, so the prune floor must exceed the max cadence; capping cadence at 1h with a 2h floor (2x margin) makes it safe and also rejects malformed cadences that previously fell back silently to 300s. All 3707 alerts already use the default 5m; this holds oban_jobs at tens of thousands of rows instead of millions.
  2. Drops the unused oban_jobs_meta_index GIN index (zero index scans in production), removing pure write amplification from the hottest table. Done CONCURRENTLY.
  3. Corrects the CNPG dashboard Connection Usage gauge: it summed cnpg_backends_total across every cluster and all three instances over a single instance’s max_connections, so with the cluster filter on “All” it read ~94% during this incident while the tuist cluster was actually ~44% of its 200 ceiling. Now computed per instance (and surfaced as the max), so the signal is correct regardless of selection. This is the misleading gauge that initially made the connection ceiling look like the cause.
  4. Scopes the CNPG Transaction Rate and Row Write Rate panels to the primary instance (instance_role="primary"). They summed rate() across all three instances, and the reset-prone replicas (pg-2 restarted, pg-3 recreated at the June-14 failover) made rate() spike over wide time ranges, so the panels read orders of magnitude high (~8M commits/s while the database’s measured rate was ~330/s). The same over-aggregation also affects the Database Size stat and Connections count (summed across the 3 instances), left as a follow-up to keep this scoped.
  5. Wires up per-query latency observability (behind postgresql.cnpg.queryStats.enabled, on in managed envs). pg_stat_statements was already preloaded “for per-query latency observability” but the extension was never created and nothing exported it, so the latency signal that would have made this incident obvious never existed. Creates the extension (automatically on fresh clusters via the Cluster CR’s bootstrap.initdb.postInitSQL; a one-time infra/cnpg/pg-stat-statements.sql for the already-bootstrapped envs) and adds a custom CNPG monitoring query exporting cnpg_tuist_query_stats_* (calls / total_exec_time_ms / rows, bucketed read/write/other, 3 series).
  6. Adds Query Latency (read/write) and Query Throughput (read/write) dashboard panels off those metrics, primary-scoped.
  7. Decouples the Slack report window from oban_jobs. ReportWorker derived “time since last report” from the last completed ReportWorker row; under the new retention that row is pruned (reports fire days apart), collapsing the window to a fixed 24h. Now a nullable projects.last_reported_at is stamped on each successful send and read for the next window — surviving pruning, and only a successful send advances it (preserving the prior discarded-doesn’t-advance behavior). Flagged in review.

Extension provisioning for #5

shared_preload_libraries already includes pg_stat_statements, so no Postgres restart is needed. Fresh/rebuilt clusters create the extension automatically via bootstrap.initdb.postInitSQL (gated on queryStats.enabled). The already-bootstrapped envs (prod/staging/canary) need a one-time run of infra/cnpg/pg-stat-statements.sql against postgres (per infra/cnpg/README.md); until then the new panels show “No data” (harmless), and the extension persists afterward (including across physical restores). Fully declarative reconciliation on existing clusters would require a CNPG operator upgrade to >= 1.26 (Database.spec.extensions); this operator’s Database CRD lacks that field. Verify the exact metric name (cnpg_tuist_query_stats_*) once the exporter picks up the query and adjust the panel exprs if CNPG’s prefix differs.

Recommended follow-ups (not in this PR)

  1. Close the dropped-webhook recovery gap, so a missed queued delivery can’t permanently orphan a run. This PR fixes the trigger that dropped webhooks here (primary pressure from oban_jobs), but the recovery path that’s supposed to catch a dropped delivery has a structural hole, and it’s worth closing independently — any future primary-pressure event, or even a one-off GitHub-side delivery miss, can still strand a run. Today there are two recovery paths and a queued workflow job can fall between them:
    • WebhookRedeliveryWorker (every 5 min, 15-min lookback, @lookback_minutes) only re-fires deliveries GitHub recorded as failed, and only if recovery lands within 15 min of the failure. A delivery that keeps failing past that window (e.g. against a degraded primary) ages out and is never retried.
    • OrphanedRunnersWorker (every min) only reconciles rows already in status='running'. A queued delivery that never landed created no runner_jobs row, so there is nothing to reconcile. This makes orphaning a latch: a single permanently-failed queued delivery strands that run forever, independent of DB health. The durable fix is a level-triggered reconciler for the queued edge, mirroring OrphanedRunnersWorker on the inbound side: periodically ask GitHub for workflow jobs that are actually queued against our self-hosted labels (scoped to installations with recent runner activity), diff against CH runner_jobs, and record_queued + enqueue dispatch for any GitHub-queued job we have no dispatchable row for. Being level-triggered, it is bounded by neither the 15-min window nor whether GitHub logged a failed delivery. Cheaper stopgaps in the meantime: widen @lookback_minutes (GUID dedup already prevents double-redelivery, so this is low-risk but still time-bounded); make the workflow_job webhook endpoint resilient under primary pressure so queued deliveries don’t return non-2xx in the first place (decouple the DispatchWorker enqueue from the request path / fast-ack, then record durably async); and alert on tuist_runners_recovery_count{kind="redelivered"} > 0 and on any reconciler-detected gap, so a silent miss pages instead of waiting for a customer bug report.
  2. Lifting the 1h cadence cap (now enforced in the changeset) requires decoupling AutomationScheduler’s dedup from completed Oban rows onto a cursor on the alert (mirroring the existing last_scoped_evaluation_inserted_at used by the ingestion path); then the prune floor no longer bounds cadence. Low urgency (all alerts are 5m).
  3. Reduce AlertEvaluationWorker enqueue volume (the largest single contributor to the churn).
  4. Physically reclaim the bloated args GIN and heap. The operational prune plus plain VACUUM frees space for reuse but does not shrink the files, so the write-amplification relief is partial until a REINDEX CONCURRENTLY (or pg_repack).
  5. device_codes is at 100% dead tuples and has never been autovacuumed; investigate separately.

Operational remediation (done)

The existing backlog was pruned via break-glass: a batched, lag-gated delete of completed/cancelled/discarded jobs older than 1 hour, paced against replica flush_lag so it never degraded live commit latency on the synchronous cluster. Result: ~6.39M rows removed, oban_jobs down from ~7.45M to ~40k live rows. Autovacuum reclaimed the dead tuples and refreshed planner stats concurrently, so read amplification fell from a lifetime ~240:1 to ~19:1 (rows returned vs fetched) and the queries stopped doing million-row scans. The physical files (~5.9 GB, incl. the 1.5 GB GIN) are reclaimed for reuse but not shrunk; a REINDEX CONCURRENTLY / pg_repack to return the disk is a follow-up (no longer a perf concern, since the live set is now ~40k rows). Pruning completed jobs re-fires long-cadence AutomationScheduler evaluations once (bounded, self-healing).

Runs already orphaned during the incident do not self-recover (their queued webhooks are past the 15-minute redelivery window, and nothing reconciles a never-delivered queued — see follow-up 1); they need a fresh push / re-run.

Validation

  • Prune verified directly against the primary: oban_jobs 7.45M → ~40k live rows; measured commit rate ~330/s and read amplification ~19:1 afterward (confirming the dashboard’s inflated ~8M “transaction rate” was a metrics artifact, not real load).
  • mix format clean on the Elixir changes.
  • Dashboard JSON validated (round-trip byte-stable; panels parse; non-overlapping gridPos).
  • Query-latency wiring rendered with helm template against values-managed-production.yaml; the rendered ConfigMap’s embedded query YAML parses and the cluster’s customQueriesConfigMap resolves to the same ConfigMap name.
  • The live cnpg_tuist_query_stats_* metric cannot be exercised until the extension is created and the chart deploys (see the one-time step above).
  • Cadence validation exercised via a no-DB changeset eval: 5m/30s/60m/1h valid; 2h/90m rejected (>1h); abc/5/5d rejected (malformed); default valid. The drop-meta-index migration compiles. The full ExUnit suite could not run locally due to a pre-existing from-scratch ecto.migrate failure (an earlier migration’s users does not exist, unrelated to this change — these commits are the newest migration and run last); CI exercises it.
Comments
E
esnunes Jun 17, 2026

File: server/lib/tuist/slack/workers/report_worker.ex:155 Issue: get_last_report_time/1 silently breaks under 2h retention. It reads the last completed ReportWorker job from oban_jobs to compute the Slack report window. Reports fire >= 24h apart and report_days_of_week allows weekly cadences, so the prior completed job is always pruned at 2h -> returns nil -> Reports.report falls back to a fixed 24h window.

F
fortmarek Jun 17, 2026

get_last_report_time/1 silently breaks under 2h retention

Fixed in 32da293d0de. Decoupled the report window from oban_jobs: added a nullable projects.last_reported_at, stamped by ReportWorker on each successful send and read for the next window, so it survives the 2h prune instead of collapsing to a fixed 24h. Only a successful send advances it, preserving the prior “a discarded/failed attempt does not move the window” behavior. Verified locally (report_worker + alert suites, 56 tests green).