fix(server): scale flaky-alert rolling recovery query so alerts don’t wedge

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11462

Updated

Jun 24, 2026

Details

Purpose

Recovery-enabled rolling flaky-test alerts were silently going dormant: they stopped marking, muting, and recovering tests entirely, while the rest of the automation engine kept running. The highest-volume alerts (those quarantining the most tests) were the ones that died.

Root cause

Rolling-window alerts that have a baseline are not cadence-scheduled; they run only through the scoped/incremental evaluation path, and only recovery_enabled rolling alerts execute batch_runs_since_trigger. That query had two problems that both scale with the number of quarantined tests:

It used where: r.test_case_id in ^ids, the IN ^list form that overflows ClickHouse’s request limits for large sets. This is exactly why Tests.test_case_ids_with_successful_default_branch_run batches with an Array(UUID) parameter instead.
It streamed every run since the oldest candidate’s triggered_at back to the app to count in Elixir — millions of rows for a long-muted, high-frequency test.

When the query raised, the scoped cursor (last_scoped_evaluation_inserted_at) — which is advanced only after all chunks process successfully — never moved. The next tick then re-fetched an ever-growing window since the last success, raised again, and so on: a permanent, worsening wedge. Because only recovery-enabled rolling alerts hit this query, low-volume and recovery-disabled alerts were unaffected, which matched the observed pattern (high-recovery-count alerts dead, everything else healthy).

What changed

batch_runs_since_trigger now:

Aggregates the run count inside ClickHouse (GROUP BY test_case_id, one row per candidate) instead of streaming raw runs back to count in Elixir.
Applies each candidate’s own triggered_at cutoff exactly, via toUnixTimestamp64Micro(ran_at) > arrayElement(cutoffs, indexOf(ids, test_case_id)) where cutoffs is positionally aligned with the id list.
Processes candidates in batches (@recovery_candidate_batch_size) using the Array(UUID) fragment + multipart: true, so the parameter and scan stay within ClickHouse’s request limits regardless of how many tests an alert has quarantined.

Recovery semantics are unchanged (the per-candidate count is still exact); only the scale characteristics change. With the recovery query bounded, the cursor advances normally and an already-wedged alert self-heals on its next tick, so no cursor-side change is needed.

How to test locally

cd server && mix test test/tuist/automations/workers/alert_evaluation_worker_test.exs

This includes a new real-ClickHouse integration test (rolling recovery counts only runs after each candidate's own trigger) that seeds test_case_runs and executes the rewritten query end to end. It asserts the per-candidate cutoff is honored: a candidate whose extra runs fall after the batch’s earliest trigger but before its own trigger is correctly not counted, so it stays below the rolling window. The other recovery tests mock the repo, so this is the only test that exercises the actual SQL.

Verifying impact after deploy

The previously-dormant high-recovery alerts should resume producing triggered/recovered events in automation_alert_events, and chronically flaky tests should get re-muted while they remain flaky.

Comments

No GitHub comments yet.