Hive
fix(server): harden GitHub webhook dispatch + add deliveries inspector task
GitHub issue · Closed
Hardens the workflow_job webhook dispatch path against the failure mode that caused the 2026-05-19 incident, and adds a small operator tool to inspect (and redeliver) failed webhooks.
What broke on 2026-05-19
PostgreSQL SSL disconnects → request handlers hung on PG checkouts → HTTP body-read timeouts spiked 77x → backend connections reset to ingress → 3.78 eps of 502s for ~30 min. Some workflow_job.queued deliveries never reached the controller (visible only in the GitHub App’s delivery log); the ones that did were uniformly marked outcome=ignored because the synchronous K8s LIST inside match_pool stalled with everything else and returned {:error, :no_pools}. Net effect: jobs queued in GitHub stayed queued because Tuist never persisted them.
Three changes
- Async dispatch via Oban. Controller now enqueues a
Tuist.Runners.Workers.DispatchWorkeronto the existing:webhooksqueue and 200s the moment the signature is verified. PG + K8s + CH are off the request path.uniquebydelivery_guidcollapses GitHub retries of the same payload. - Cache the K8s LIST and account lookup. RunnerPool CRs are operator-managed and
runner_max_concurrentis re-checked in PG at claim time, so a short cache is safe. Pools: 30s TTL, success-only. Accounts: 60s TTL, successful lookups only. - Split the
:ignoredtelemetry outcome. Each ignore branch carries its own atom (no_account/runners_disabled/no_matching_pool/no_pools/ambiguous_pool) and that atom flows through totuist_runners_webhook_count_total. Ano_poolsspike will now be a distinct, alertable signal — the previous flatignoredmasked the apiserver outage as user-caused.
Mise task: runner:recent-deliveries
mise run runner:recent-deliveries -- --env production --since 2026-05-19T15:20:00Z pulls the App credentials from 1Password, mints an RS256 JWT, and queries GET /app/hook/deliveries. Filters by --since / --workflow-job-id; --redeliver <id> reissues a failed delivery without going to the GitHub UI. Used during the incident investigation; checked in so it’s reusable.
How to test locally
Server changes:
cd server
mix test test/tuist/runners/dispatch_test.exs \
test/tuist/runners/workers/dispatch_worker_test.exs \
test/tuist_web/controllers/webhooks/github_controller_test.exs
mix credo lib/tuist/runners/dispatch.ex lib/tuist/runners/workers/dispatch_worker.ex lib/tuist_web/controllers/webhooks/github_controller.ex
Mise task (requires op signed into tuist.1password.com):
mise run runner:recent-deliveries -- --env production --since 2026-05-19T15:20:00Z --limit 500 \
| jq 'group_by({event, action, status_code}) | map({event: .[0].event, action: .[0].action, status_code: .[0].status_code, n: length}) | sort_by(-.n)'
Prints a histogram of webhook outcomes for the given window.
No GitHub comments yet.