Hive Hive
Sign in

fix(server): harden GitHub webhook dispatch + add deliveries inspector task

GitHub issue · Closed

Metadata
Source
tuist/tuist #10878
Updated
Jun 24, 2026
Domains
Compute
Details

Hardens the workflow_job webhook dispatch path against the failure mode that caused the 2026-05-19 incident, and adds a small operator tool to inspect (and redeliver) failed webhooks.

What broke on 2026-05-19

PostgreSQL SSL disconnects → request handlers hung on PG checkouts → HTTP body-read timeouts spiked 77x → backend connections reset to ingress → 3.78 eps of 502s for ~30 min. Some workflow_job.queued deliveries never reached the controller (visible only in the GitHub App’s delivery log); the ones that did were uniformly marked outcome=ignored because the synchronous K8s LIST inside match_pool stalled with everything else and returned {:error, :no_pools}. Net effect: jobs queued in GitHub stayed queued because Tuist never persisted them.

Three changes

  1. Async dispatch via Oban. Controller now enqueues a Tuist.Runners.Workers.DispatchWorker onto the existing :webhooks queue and 200s the moment the signature is verified. PG + K8s + CH are off the request path. unique by delivery_guid collapses GitHub retries of the same payload.
  2. Cache the K8s LIST and account lookup. RunnerPool CRs are operator-managed and runner_max_concurrent is re-checked in PG at claim time, so a short cache is safe. Pools: 30s TTL, success-only. Accounts: 60s TTL, successful lookups only.
  3. Split the :ignored telemetry outcome. Each ignore branch carries its own atom (no_account / runners_disabled / no_matching_pool / no_pools / ambiguous_pool) and that atom flows through to tuist_runners_webhook_count_total. A no_pools spike will now be a distinct, alertable signal — the previous flat ignored masked the apiserver outage as user-caused.

Mise task: runner:recent-deliveries

mise run runner:recent-deliveries -- --env production --since 2026-05-19T15:20:00Z pulls the App credentials from 1Password, mints an RS256 JWT, and queries GET /app/hook/deliveries. Filters by --since / --workflow-job-id; --redeliver <id> reissues a failed delivery without going to the GitHub UI. Used during the incident investigation; checked in so it’s reusable.

How to test locally

Server changes:

cd server
mix test test/tuist/runners/dispatch_test.exs \
test/tuist/runners/workers/dispatch_worker_test.exs \
test/tuist_web/controllers/webhooks/github_controller_test.exs
mix credo lib/tuist/runners/dispatch.ex lib/tuist/runners/workers/dispatch_worker.ex lib/tuist_web/controllers/webhooks/github_controller.ex

Mise task (requires op signed into tuist.1password.com):

mise run runner:recent-deliveries -- --env production --since 2026-05-19T15:20:00Z --limit 500 \
| jq 'group_by({event, action, status_code}) | map({event: .[0].event, action: .[0].action, status_code: .[0].status_code, n: length}) | sort_by(-.n)'

Prints a histogram of webhook outcomes for the given window.

Comments

No GitHub comments yet.