feat(server): reap runner jobs stranded in the queued phase

Metadata

Source

tuist/tuist #11455

Updated

Jun 24, 2026

Domains

Compute

Details

What changed

Adds StaleQueuedJobsWorker, a recovery worker that resolves runner_jobs rows stranded in the queued phase. It runs on the hosted-only crontab every 5 minutes.

A workflow_job recorded in the ClickHouse runner_jobs table advances through queued → claimed → running → completed, one INSERT per transition. Recovery previously existed only for the later states:

StaleClaimsWorker reaps Postgres claimed rows after 5 min.
OrphanedRunnersWorker reaps CH running rows by cross-checking GitHub.
WebhookRedeliveryWorker re-requests failed webhook deliveries, but only within a 15-minute window.

Nothing reaped the queued state.

Why

A row stranded in queued is invisible to every existing recovery path: it has no Postgres claim (so StaleClaimsWorker can’t see it) and never reaches running (so OrphanedRunnersWorker can’t see it). The only thing that moves a queued row to a terminal state is the workflow_job.completed webhook. When that webhook never fires (GitHub kept the job queued on its side because no self-hosted runner ever accepted it) or was lost past the 15-minute redelivery window, the row sits at queued on the dashboard indefinitely.

This surfaced as jobs showing Queued for 28 days in the Runners dashboard, with GitHub itself also reporting the same jobs as queued (never picked up). The root cause is that the lifecycle design implicitly trusted GitHub to always deliver a terminal completed event; when that assumption breaks, the row is orphaned forever.

How the fix works

StaleQueuedJobsWorker lists queued rows older than 1h (Jobs.list_stale_queued/1) and, per row, cross-checks GitHub the same way OrphanedRunnersWorker does (GET /repos/{owner}/{repo}/actions/jobs/{id}):

GitHub status	Action
`completed`	Reconcile — mark completed with GitHub’s real conclusion (we missed the webhook)
`404` (pruned)	Complete — it cannot be live
`in_progress`	Leave it — a runner accepted it; it will fire `completed` within GitHub’s per-job limit
`queued`	Leave it — still legitimately pending, unless past the hard backstop

Hard backstop at 24h: any row queued longer than 24h that GitHub still reports queued, cannot be addressed (empty repository on legacy pre-profiles rows), or cannot be verified (GitHub API down) is force-completed with conclusion "stale". Past that age nothing will ever move it, so the guarantee that a job cannot stay stuck in queued holds even when GitHub never resolves it. in_progress is the only state never force-completed (it is provably live). The worker defensively frees any leaked Postgres claim (PG-first) before recording the CH terminal state, matching the webhook completion path.

Why a GitHub cross-check rather than a pure time-based reaper

Cross-checking yields the correct conclusion for the recoverable case (a missed completed webhook shows success/failure rather than a synthetic stale), and it mirrors the existing OrphanedRunnersWorker philosophy that GitHub’s status is the source of truth. The time-based backstop is retained purely as the airtight guarantee for the unrecoverable case.

Impact

No job can remain in the queued phase indefinitely. Genuinely-pending jobs (account at cap, pool scaling) are left untouched until 1h, and even then only acted on when GitHub agrees they are terminal or the 24h backstop fires. Steady-state GitHub API cost is near nil because the candidate set is normally empty.

How to test locally

cd server
mix test test/tuist/runners/workers/stale_queued_jobs_worker_test.exs \
         test/tuist/runners/jobs_test.exs \
         test/tuist/oban/runtime_config_test.exs

Validation run: 64 tests pass (worker branch coverage including the 24h backstop reap, the lookup-failure backstop, and the no-repository path; list_stale_queued/1; and crontab membership), credo clean, format clean.

Comments

No GitHub comments yet.