Hive Hive
Sign in

feat(server): recover failed webhook deliveries via GitHub’s redelivery API

GitHub issue · Closed

Metadata
Source
tuist/tuist #10909
Updated
Jun 24, 2026
Domains
Compute
Details

Summary

Adds Tuist.Runners.Workers.WebhookRedeliveryWorker, an Oban cron that asks GitHub to redeliver any workflow_job webhook deliveries that failed within a recent window. Closes a P0 correctness gap in the runners architecture where a never-delivered webhook leaves the customer’s workflow_job stuck in GitHub’s queue indefinitely.

This PR went through one design pivot before landing — see the PR commits for the evolution. The shipped design uses GitHub’s App-level webhook delivery log + redelivery endpoints; the earlier design enumerated workflow_runs/jobs per-repo and didn’t scale to large installations.

Why

The workflow_job: queued webhook is currently the only path that inserts a row into ClickHouse runner_jobs. Failure modes that bypass it:

  • GitHub provider outage exhausting webhook retries
  • Our endpoint returning 5xx during a deploy window
  • Signature verification failure
  • Replay-after-deploy where GitHub stopped retrying before our endpoint came back

The existing OrphanedRunnersWorker only reconciles rows already in CH status='running'. The never-delivered case has no recovery today — customer’s workflow_job sits in GitHub’s queue indefinitely with no row, no metric, no alert on our side.

What changed

Mirrors GitHub’s documented pattern:

  1. GET /app/hook/deliveries?status=failure (App-wide, paginated) across a 15-min lookback window.
  2. Filter to event="workflow_job".
  3. Group attempts by guid (constant across original delivery and any redelivery attempts of the same logical event).
  4. For each GUID whose attempts include no successful one, POST /app/hook/deliveries/{id}/attempts on the most recent attempt — GitHub re-fires through our normal webhook URL.

The redelivery hits the same handle_workflow_job path as a fresh delivery, going through DispatchWorkerDispatch.handle_webhookJobs.enqueue. No separate recovery codepath, no payload reconstruction, no risk of recovery-side enqueue diverging from webhook-side enqueue.

Why this shape vs the alternatives

  • Per-repo enumeration (the design we explored first): O(repos × installations) API calls per cycle. A customer with hundreds of repos exhausts the 5000 req/hr installation rate-limit fast. Also requires walking ?status=queued AND ?status=in_progress to catch matrix/needs:-downstream jobs, and paginating jobs to handle matrix expansion. Lots of code, lots of edge cases.
  • Long-poll listener (the ARC v2 model): Uses GitHub-internal /_apis/distributedtask/... endpoints. Not a public-stable surface; would also require per-tenant persistent processes. Doesn’t fit our shape.
  • Webhook delivery log + redelivery: O(failures) per cycle App-wide, independent of repo count or installation count. Single codepath. GitHub-blessed pattern.

Design decisions

  • Cadence: 5 min. Diverges from GitHub’s documented 6h example (which assumes the script runs as a GitHub Actions workflow). For a runners product, customer-visible queue stalls past minutes are unacceptable. 5 min with a 15-min lookback gives 3x overlap on persistent failures.
  • GUID-based dedup, not workflow_job_id. The list response doesn’t include workflow_job_id (would require a GET .../{id} per delivery), but it does include guid. GitHub’s pattern is to check Enum.any?(attempts, & &1.status == "OK") per GUID — a successful redelivery from the previous cycle shows up as a new delivery with the same GUID and status="OK", so subsequent cycles naturally skip it.
  • Workflow_job event filter. The App’s delivery log also includes push, pull_request, etc. The runners worker only redelivers workflow_job events; other event recovery is out of scope.
  • Most-recent-attempt redelivery. When multiple failed attempts share a GUID, redeliver from the newest — GitHub preserves the request metadata of that specific attempt.
  • Transient failures log + continue. Both the list call and the per-redelivery call can return errors (rate-limit, 5xx, 422). Worker logs and continues; next 5-min cycle retries naturally.

Test plan

10 tests in test/tuist/runners/workers/webhook_redelivery_worker_test.exs:

  • No-op when no failed deliveries
  • Redelivers a failed workflow_job
  • GUID dedup: skip when any attempt for a GUID has status="OK"
  • Skips non-workflow_job events (push / PR / etc.)
  • Redelivers the most-recent attempt when multiple failed attempts share a GUID
  • Paginates through delivery pages
  • Stops paginating once a page contains pre-threshold entries (cost cap)
  • Skips on transient list-deliveries failure
  • Skips on 422 from redelivery endpoint
  • Emits tuist_runners_recovery_count{kind="redelivered"} telemetry

Plus existing runners suite (test/tuist/runners/, 77 tests) — green. mix credo clean. mix format applied.

What’s not in scope here

  • Other event types (e.g. missed installation.created). Could be added by relaxing the event filter, but each event has different downstream semantics worth case-by-case review.
  • workflow_job.completed recovery via local CH update. OrphanedRunnersWorker already handles this for rows in status='running'. A completed redelivery would also flow through and hit the same mark_completed path — belt-and-braces but not strictly necessary.
  • Retention window beyond GitHub’s default. GitHub keeps deliveries for ~30 days (per their docs UX, not formally guaranteed in the REST docs). 15-min lookback is well inside that.
Comments

No GitHub comments yet.