Hive
fix(server): scale flaky-alert rolling recovery query so alerts don’t wedge
GitHub issue · Closed
Purpose
Recovery-enabled rolling flaky-test alerts were silently going dormant: they stopped marking, muting, and recovering tests entirely, while the rest of the automation engine kept running. The highest-volume alerts (those quarantining the most tests) were the ones that died.
Root cause
Rolling-window alerts that have a baseline are not cadence-scheduled; they run only through the scoped/incremental evaluation path, and only recovery_enabled rolling alerts execute batch_runs_since_trigger. That query had two problems that both scale with the number of quarantined tests:
- It used
where: r.test_case_id in ^ids, theIN ^listform that overflows ClickHouse’s request limits for large sets. This is exactly whyTests.test_case_ids_with_successful_default_branch_runbatches with anArray(UUID)parameter instead. - It streamed every run since the oldest candidate’s
triggered_atback to the app to count in Elixir — millions of rows for a long-muted, high-frequency test.
When the query raised, the scoped cursor (last_scoped_evaluation_inserted_at) — which is advanced only after all chunks process successfully — never moved. The next tick then re-fetched an ever-growing window since the last success, raised again, and so on: a permanent, worsening wedge. Because only recovery-enabled rolling alerts hit this query, low-volume and recovery-disabled alerts were unaffected, which matched the observed pattern (high-recovery-count alerts dead, everything else healthy).
What changed
batch_runs_since_trigger now:
- Aggregates the run count inside ClickHouse (
GROUP BY test_case_id, one row per candidate) instead of streaming raw runs back to count in Elixir. - Applies each candidate’s own
triggered_atcutoff exactly, viatoUnixTimestamp64Micro(ran_at) > arrayElement(cutoffs, indexOf(ids, test_case_id))wherecutoffsis positionally aligned with the id list. - Processes candidates in batches (
@recovery_candidate_batch_size) using theArray(UUID)fragment +multipart: true, so the parameter and scan stay within ClickHouse’s request limits regardless of how many tests an alert has quarantined.
Recovery semantics are unchanged (the per-candidate count is still exact); only the scale characteristics change. With the recovery query bounded, the cursor advances normally and an already-wedged alert self-heals on its next tick, so no cursor-side change is needed.
How to test locally
cd server && mix test test/tuist/automations/workers/alert_evaluation_worker_test.exs
This includes a new real-ClickHouse integration test (rolling recovery counts only runs after each candidate's own trigger) that seeds test_case_runs and executes the rewritten query end to end. It asserts the per-candidate cutoff is honored: a candidate whose extra runs fall after the batch’s earliest trigger but before its own trigger is correctly not counted, so it stays below the rolling window. The other recovery tests mock the repo, so this is the only test that exercises the actual SQL.
Verifying impact after deploy
The previously-dormant high-recovery alerts should resume producing triggered/recovered events in automation_alert_events, and chronically flaky tests should get re-muted while they remain flaky.
No GitHub comments yet.