Hive
feat(server): recover failed webhook deliveries via GitHub’s redelivery API
GitHub issue · Closed
Summary
Adds Tuist.Runners.Workers.WebhookRedeliveryWorker, an Oban cron that asks GitHub to redeliver any workflow_job webhook deliveries that failed within a recent window. Closes a P0 correctness gap in the runners architecture where a never-delivered webhook leaves the customer’s workflow_job stuck in GitHub’s queue indefinitely.
This PR went through one design pivot before landing — see the PR commits for the evolution. The shipped design uses GitHub’s App-level webhook delivery log + redelivery endpoints; the earlier design enumerated workflow_runs/jobs per-repo and didn’t scale to large installations.
Why
The workflow_job: queued webhook is currently the only path that inserts a row into ClickHouse runner_jobs. Failure modes that bypass it:
- GitHub provider outage exhausting webhook retries
- Our endpoint returning 5xx during a deploy window
- Signature verification failure
- Replay-after-deploy where GitHub stopped retrying before our endpoint came back
The existing OrphanedRunnersWorker only reconciles rows already in CH status='running'. The never-delivered case has no recovery today — customer’s workflow_job sits in GitHub’s queue indefinitely with no row, no metric, no alert on our side.
What changed
Mirrors GitHub’s documented pattern:
GET /app/hook/deliveries?status=failure(App-wide, paginated) across a 15-min lookback window.- Filter to
event="workflow_job". - Group attempts by
guid(constant across original delivery and any redelivery attempts of the same logical event). - For each GUID whose attempts include no successful one,
POST /app/hook/deliveries/{id}/attemptson the most recent attempt — GitHub re-fires through our normal webhook URL.
The redelivery hits the same handle_workflow_job path as a fresh delivery, going through DispatchWorker → Dispatch.handle_webhook → Jobs.enqueue. No separate recovery codepath, no payload reconstruction, no risk of recovery-side enqueue diverging from webhook-side enqueue.
Why this shape vs the alternatives
- Per-repo enumeration (the design we explored first): O(repos × installations) API calls per cycle. A customer with hundreds of repos exhausts the 5000 req/hr installation rate-limit fast. Also requires walking
?status=queuedAND?status=in_progressto catch matrix/needs:-downstream jobs, and paginating jobs to handle matrix expansion. Lots of code, lots of edge cases. - Long-poll listener (the ARC v2 model): Uses GitHub-internal
/_apis/distributedtask/...endpoints. Not a public-stable surface; would also require per-tenant persistent processes. Doesn’t fit our shape. - Webhook delivery log + redelivery: O(failures) per cycle App-wide, independent of repo count or installation count. Single codepath. GitHub-blessed pattern.
Design decisions
- Cadence: 5 min. Diverges from GitHub’s documented 6h example (which assumes the script runs as a GitHub Actions workflow). For a runners product, customer-visible queue stalls past minutes are unacceptable. 5 min with a 15-min lookback gives 3x overlap on persistent failures.
- GUID-based dedup, not workflow_job_id. The list response doesn’t include
workflow_job_id(would require aGET .../{id}per delivery), but it does includeguid. GitHub’s pattern is to checkEnum.any?(attempts, & &1.status == "OK")per GUID — a successful redelivery from the previous cycle shows up as a new delivery with the same GUID andstatus="OK", so subsequent cycles naturally skip it. - Workflow_job event filter. The App’s delivery log also includes
push,pull_request, etc. The runners worker only redeliversworkflow_jobevents; other event recovery is out of scope. - Most-recent-attempt redelivery. When multiple failed attempts share a GUID, redeliver from the newest — GitHub preserves the request metadata of that specific attempt.
- Transient failures log + continue. Both the list call and the per-redelivery call can return errors (rate-limit, 5xx, 422). Worker logs and continues; next 5-min cycle retries naturally.
Test plan
10 tests in test/tuist/runners/workers/webhook_redelivery_worker_test.exs:
- No-op when no failed deliveries
- Redelivers a failed workflow_job
- GUID dedup: skip when any attempt for a GUID has
status="OK" - Skips non-workflow_job events (push / PR / etc.)
- Redelivers the most-recent attempt when multiple failed attempts share a GUID
- Paginates through delivery pages
- Stops paginating once a page contains pre-threshold entries (cost cap)
- Skips on transient list-deliveries failure
- Skips on 422 from redelivery endpoint
- Emits
tuist_runners_recovery_count{kind="redelivered"}telemetry
Plus existing runners suite (test/tuist/runners/, 77 tests) — green.
mix credo clean. mix format applied.
What’s not in scope here
- Other event types (e.g. missed
installation.created). Could be added by relaxing the event filter, but each event has different downstream semantics worth case-by-case review. workflow_job.completedrecovery via local CH update.OrphanedRunnersWorkeralready handles this for rows instatus='running'. Acompletedredelivery would also flow through and hit the samemark_completedpath — belt-and-braces but not strictly necessary.- Retention window beyond GitHub’s default. GitHub keeps deliveries for ~30 days (per their docs UX, not formally guaranteed in the REST docs). 15-min lookback is well inside that.
No GitHub comments yet.