feat(server): recover failed webhook deliveries via GitHub’s redelivery API

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #10909

Updated

Jun 24, 2026

Domains

Compute

Details

Summary

Adds Tuist.Runners.Workers.WebhookRedeliveryWorker, an Oban cron that asks GitHub to redeliver any workflow_job webhook deliveries that failed within a recent window. Closes a P0 correctness gap in the runners architecture where a never-delivered webhook leaves the customer’s workflow_job stuck in GitHub’s queue indefinitely.

This PR went through one design pivot before landing — see the PR commits for the evolution. The shipped design uses GitHub’s App-level webhook delivery log + redelivery endpoints; the earlier design enumerated workflow_runs/jobs per-repo and didn’t scale to large installations.

Why

The workflow_job: queued webhook is currently the only path that inserts a row into ClickHouse runner_jobs. Failure modes that bypass it:

GitHub provider outage exhausting webhook retries
Our endpoint returning 5xx during a deploy window
Signature verification failure
Replay-after-deploy where GitHub stopped retrying before our endpoint came back

The existing OrphanedRunnersWorker only reconciles rows already in CH status='running'. The never-delivered case has no recovery today — customer’s workflow_job sits in GitHub’s queue indefinitely with no row, no metric, no alert on our side.

What changed

Mirrors GitHub’s documented pattern:

GET /app/hook/deliveries?status=failure (App-wide, paginated) across a 15-min lookback window.
Filter to event="workflow_job".
Group attempts by guid (constant across original delivery and any redelivery attempts of the same logical event).
For each GUID whose attempts include no successful one, POST /app/hook/deliveries/{id}/attempts on the most recent attempt — GitHub re-fires through our normal webhook URL.

The redelivery hits the same handle_workflow_job path as a fresh delivery, going through DispatchWorker → Dispatch.handle_webhook → Jobs.enqueue. No separate recovery codepath, no payload reconstruction, no risk of recovery-side enqueue diverging from webhook-side enqueue.

Why this shape vs the alternatives

Per-repo enumeration (the design we explored first): O(repos × installations) API calls per cycle. A customer with hundreds of repos exhausts the 5000 req/hr installation rate-limit fast. Also requires walking ?status=queued AND ?status=in_progress to catch matrix/needs:-downstream jobs, and paginating jobs to handle matrix expansion. Lots of code, lots of edge cases.
Long-poll listener (the ARC v2 model): Uses GitHub-internal /_apis/distributedtask/... endpoints. Not a public-stable surface; would also require per-tenant persistent processes. Doesn’t fit our shape.
Webhook delivery log + redelivery: O(failures) per cycle App-wide, independent of repo count or installation count. Single codepath. GitHub-blessed pattern.

Design decisions

Cadence: 5 min. Diverges from GitHub’s documented 6h example (which assumes the script runs as a GitHub Actions workflow). For a runners product, customer-visible queue stalls past minutes are unacceptable. 5 min with a 15-min lookback gives 3x overlap on persistent failures.
GUID-based dedup, not workflow_job_id. The list response doesn’t include workflow_job_id (would require a GET .../{id} per delivery), but it does include guid. GitHub’s pattern is to check Enum.any?(attempts, & &1.status == "OK") per GUID — a successful redelivery from the previous cycle shows up as a new delivery with the same GUID and status="OK", so subsequent cycles naturally skip it.
Workflow_job event filter. The App’s delivery log also includes push, pull_request, etc. The runners worker only redelivers workflow_job events; other event recovery is out of scope.
Most-recent-attempt redelivery. When multiple failed attempts share a GUID, redeliver from the newest — GitHub preserves the request metadata of that specific attempt.
Transient failures log + continue. Both the list call and the per-redelivery call can return errors (rate-limit, 5xx, 422). Worker logs and continues; next 5-min cycle retries naturally.

Test plan

10 tests in test/tuist/runners/workers/webhook_redelivery_worker_test.exs:

No-op when no failed deliveries
Redelivers a failed workflow_job
GUID dedup: skip when any attempt for a GUID has status="OK"
Skips non-workflow_job events (push / PR / etc.)
Redelivers the most-recent attempt when multiple failed attempts share a GUID
Paginates through delivery pages
Stops paginating once a page contains pre-threshold entries (cost cap)
Skips on transient list-deliveries failure
Skips on 422 from redelivery endpoint
Emits tuist_runners_recovery_count{kind="redelivered"} telemetry

Plus existing runners suite (test/tuist/runners/, 77 tests) — green. mix credo clean. mix format applied.

What’s not in scope here

Other event types (e.g. missed installation.created). Could be added by relaxing the event filter, but each event has different downstream semantics worth case-by-case review.
workflow_job.completed recovery via local CH update. OrphanedRunnersWorker already handles this for rows in status='running'. A completed redelivery would also flow through and hit the same mark_completed path — belt-and-braces but not strictly necessary.
Retention window beyond GitHub’s default. GitHub keeps deliveries for ~30 days (per their docs UX, not formally guaranteed in the REST docs). 15-min lookback is well inside that.

Comments

No GitHub comments yet.