Hive Hive
Sign in

fix(capi-scaleway, server): unstick orphaned running workflow_jobs end-to-end

GitHub issue · Closed

Metadata
Source
tuist/tuist #10828
Updated
Jun 24, 2026
Domains
Compute
Details

Two fixes targeting the same failure mode that left job 76348428905 stuck in GitHub’s queue for 100+ minutes on 2026-05-16:

  1. Ship the existing pf-table-busy fix in macos-host-bootstrap to production by flipping the capi-scaleway cliff to path-based filtering.
  2. Server-side recovery for the case where the Pod claims a JIT but the runner agent never registers with GitHub.

Background — what happened to shard 0

Investigation pinned the cause to a degraded node (`m789d`) in the runner fleet. The pod was scheduled and the dispatch endpoint successfully transitioned the row to `status=‘running’`, but the runner container never started: no `Pulling` / `Pulled` / `Created` / `Started` events from the kubelet, no runner registration with GitHub.

Why the node was degraded: every host in the `mndbc` fleet was looping on `BootstrapFailed` (2,300+ retries) with:

``` /etc/pf.anchors/tuist.runners:11: cannot define table vm_sources: Resource busy /etc/pf.anchors/tuist.runners:12: cannot define table blocked_dst: Resource busy pfctl: Syntax error in config file: pf rules not loaded ```

The fix for this exists in bootstrap.go:745 (commit 0eabd38e6b) — `pfctl -a tuist.runners -F all` before reload — but the operator image in production is pinned to `af8204b1` (May 12), pre-fix.

Half 1 — `refactor(capi-scaleway)`: trigger the operator-image release

#10798 squashed as `fix(infra):`. The capi-scaleway cliff filtered on `^fix(capi-scaleway)`, didn’t match, and the operator-image release-on-bump never fired.

Switch to type-only parsers — path filtering via `–include-path “infra/cluster-api-provider-scaleway-applesilicon//*”` plus `–include-path “infra/macos-host-bootstrap//*”` (both already wired up in release.yml) covers every change. Mirrors the runners-controller / runner-image / xcresult-processor cliffs from #10824.

This commit itself touches the capi-scaleway include-path with a `refactor:` prefix, so the next release.yml run releases a new operator image baking in the existing pf fix — unsticking the looping nodes.

Half 2 — `feat(server)`: `OrphanedRunnersWorker`

Even with the infra fix deployed, the architecture had a gap: there’s no recovery path for “PG/CH says `running`, the JIT was minted, but the runner never registered with GitHub”. The existing `StaleClaimsWorker` only handles `lifecycle_state=‘claimed’` rows — running rows are explicitly excluded because real running builds hold the slot for hours and reaping at the 5-min threshold would kill live work.

The signal that distinguishes a real running build from an orphaned mint is the GitHub-side status of the workflow_job. If GitHub still reports the job as `queued` after we’ve transitioned through `claimed → running` locally, the runner never came up.

New worker:

  1. Lists `runner_jobs FINAL` rows in `status=‘running’` with `started_at` older than 5 min.
  2. For each, `GET /repos/{owner}/{repo}/actions/jobs/{id}` on the org’s GitHub App installation.
  3. If GH returns `queued` → re-queue in CH (`record_queued`) + release the PG claim. Another Pod picks it up.
  4. If `in_progress` → real running build, leave alone.
  5. If `completed` → out of scope; webhook redelivery handles it.
  6. Transient lookup failure → log + retry next tick.

Recovery order mirrors `StaleClaimsWorker` (CH-first; crash between CH and PG leaves the row safely re-queueable on the next tick).

GitHub API cost: one call per orphaned candidate per minute. Steady-state candidates are zero; even at 5 concurrent real running builds that’s 300 calls/hr/installation, well under the 5,000/hr app-token limit.

Tests

`mix test test/tuist/runners/workers/orphaned_runners_worker_test.exs` covers:

  • GH `queued` → re-queue + release (the shard 0 case)
  • GH `in_progress` → no-op (real running build)
  • GH `completed` → no-op (webhook redelivery handles)
  • Transient HTTP 502 → no-op + retry next tick
  • Empty candidate list → no GH calls made

5 passing, 0 failing.

How to test

  • On merge, release.yml fires `release-capi-scaleway` and bumps the operator-image tag in the chart.
  • Next server deploy applies the new operator image; existing bootstrap loop on `mndbc` fleet hosts recovers within one retry cycle (pfctl -F flushes the persist tables, reload succeeds).
  • `mix test test/tuist/runners/workers/orphaned_runners_worker_test.exs` clean.
  • Once deployed: the next time the failure mode hits, the new worker should re-queue the affected workflow_job within a minute instead of stranding it for hours.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.