Hive
fix(server): optimize flaky alert rolling evaluations
GitHub issue · Closed
Summary
This keeps the max rolling window at 1,000, but removes the expensive full-project ClickHouse scan from the steady-state path for flaky-test alerts.
- Add bucketed recent-run aggregate tables for 100, 250, 500, and 750 run windows, while keeping the existing 1,000-run aggregate as the fallback for large windows.
- Route arbitrary rolling-window integers to the smallest bucket that can satisfy them, then still slice to the exact configured value. For example, 247 reads the 250 bucket and slices to 247; 983 reads the existing 1,000 bucket and slices to 983.
- Replace the transient pending-test-case table with a compact cursor on
automation_alerts.last_scoped_evaluation_inserted_at. - After test case run ingestion, enqueue one unique delayed Oban job per enabled flaky alert, debounced until after the ClickHouse flush interval. The job does not store test case IDs in Postgres.
- Once an alert has a baseline, the worker queries
test_case_runs_by_inserted_atfor distinct test case IDs inserted since the alert cursor, with a small overlap window, then evaluates those IDs in 1,000-ID chunks through the existing scoped monitor path. - Initial baselines still run as full evaluations. Monitor definition changes reset the baseline, so changed definitions get one fresh full pass before returning to scoped incremental evaluation.
Why
The slow production query was paying two costs repeatedly:
- It read oversized 1,000-entry aggregate states even for small windows like Last 75.
- It scanned every test case in the project on each scheduled rolling evaluation, even though only test cases with newly ingested runs can change after a baseline exists.
The bucketed aggregates reduce state size for small and medium rolling windows. The cursor-based scoped evaluator handles the larger steady-state issue by turning “evaluate this whole project again” into “discover which test cases changed since the last alert evaluation, then evaluate only those cases.”
We explored a Postgres pending-ID table, but dropped it in favor of the cursor approach. The cursor avoids writing one pending row per affected test case for every ingestion burst, relies on Oban for per-alert debounce, and uses the existing ClickHouse test_case_runs_by_inserted_at materialized view to discover changed IDs cheaply.
Expected Query Performance
The production example was a Last 75 rolling query that read and merged 1,000-entry states for about 300k project test cases: about 1.47 GiB read, about 4 GiB memory, and about 1.4-1.65s runtime.
For full baseline evaluations, the bucketed aggregate state should scale roughly with the bucket size:
| Configured window | Query route | Exact result? | Expected full-project query impact |
|---|---|---|---|
| Last 50 | 100-run bucket, slice to 50 | Yes | Reads about 10% of the old aggregate state, about 150 MiB instead of 1.47 GiB. |
| Last 75 | 100-run bucket, slice to 75 | Yes | Same 10x state-size reduction. This is the production alert case and should move from the 1.4-1.65s range to low hundreds of ms if fixed ClickHouse overhead does not dominate. |
| Last 100 | 100-run bucket | Yes | Same 10x state-size reduction, with no post-bucket overread. |
| Last 247 | 250-run bucket, slice to 247 | Yes | Reads about 25% of the old aggregate state, about 375 MiB instead of 1.47 GiB. |
| Last 250 | 250-run bucket | Yes | Same roughly 4x reduction, with no post-bucket overread. |
| Last 251 | 500-run bucket, slice to 251 | Yes | Reads about 50% of the old aggregate state, about 750 MiB instead of 1.47 GiB. |
| Last 500 | 500-run bucket | Yes | Same roughly 2x reduction, with no post-bucket overread. |
| Last 501 | 750-run bucket, slice to 501 | Yes | Reads about 75% of the old aggregate state, about 1.1 GiB instead of 1.47 GiB. |
| Last 750 | 750-run bucket | Yes | Same roughly 1.3x reduction, with no post-bucket overread. |
| Last 751-1,000, for example Last 983 | Existing 1,000-run table, slice to configured window | Yes | No bucket-level I/O win for a full baseline query; it still needs the 1,000-entry state. The steady-state win comes from scoped evaluation. |
For established alerts, query cost should scale with affected test cases instead of all project test cases:
| Example | Approximate state read vs old full-project scan |
|---|---|
| Last 75, 5k affected cases out of 300k | About 1.47 GiB * 10% * 5k / 300k, roughly 2.5 MiB of aggregate state plus fixed overhead. |
| Last 75, 50k affected cases out of 300k | Roughly 25 MiB of aggregate state plus fixed overhead. |
| Last 983, 5k affected cases out of 300k | About 1.47 GiB * 5k / 300k, roughly 25 MiB plus fixed overhead. |
| Last 983, 50k affected cases out of 300k | Roughly 250 MiB plus fixed overhead. |
Worst case, if a burst touches most of the project, the scoped job can approach the old project-wide cost for high windows like Last 983. The important difference is that this is tied to actual ingestion volume and debounced per alert, instead of running the project-wide scan continuously on cadence when nothing changed.
Accuracy
The configured rolling window remains exact. Bucket selection only chooses the minimum stored aggregate that can cover the requested window; the SQL still slices to the configured integer before computing flakiness or flaky-run count.
The cursor query does not decide alert state. It only identifies test cases with new runs since the last scoped pass. The existing monitor query still computes the exact alert result for those IDs, and transitions are diffed against active alert events for the same scoped IDs so unrelated active alerts are not recovered accidentally.
Validation
MIX_ENV=test mix ecto.resetMIX_ENV=test mix test test/tuist/automations_test.exs:210-> direct local ClickHouse coverage for thetest_case_runs_by_inserted_atcursor queryMIX_ENV=test mix test test/tuist/automations_test.exs test/tuist/automations/workers/alert_evaluation_worker_test.exs test/tuist/automations/workers/automation_scheduler_test.exs test/tuist/automations/monitors/flaky_tests_monitor_test.exs test/tuist/automations/alerts/alert_test.exs-> 104 tests, 0 failuresMIX_ENV=test mix credo-> no issuesgit diff --check- Earlier validation on this PR: local ClickHouse schema check for the 100, 250, 500, 750, and 1,000 recent-run aggregate tables