fix(server): optimize flaky alert rolling evaluations

Metadata

Source

tuist/tuist #10816

Updated

Jun 24, 2026

Details

Summary

This keeps the max rolling window at 1,000, but removes the expensive full-project ClickHouse scan from the steady-state path for flaky-test alerts.

Add bucketed recent-run aggregate tables for 100, 250, 500, and 750 run windows, while keeping the existing 1,000-run aggregate as the fallback for large windows.
Route arbitrary rolling-window integers to the smallest bucket that can satisfy them, then still slice to the exact configured value. For example, 247 reads the 250 bucket and slices to 247; 983 reads the existing 1,000 bucket and slices to 983.
Replace the transient pending-test-case table with a compact cursor on automation_alerts.last_scoped_evaluation_inserted_at.
After test case run ingestion, enqueue one unique delayed Oban job per enabled flaky alert, debounced until after the ClickHouse flush interval. The job does not store test case IDs in Postgres.
Once an alert has a baseline, the worker queries test_case_runs_by_inserted_at for distinct test case IDs inserted since the alert cursor, with a small overlap window, then evaluates those IDs in 1,000-ID chunks through the existing scoped monitor path.
Initial baselines still run as full evaluations. Monitor definition changes reset the baseline, so changed definitions get one fresh full pass before returning to scoped incremental evaluation.

Why

The slow production query was paying two costs repeatedly:

It read oversized 1,000-entry aggregate states even for small windows like Last 75.
It scanned every test case in the project on each scheduled rolling evaluation, even though only test cases with newly ingested runs can change after a baseline exists.

The bucketed aggregates reduce state size for small and medium rolling windows. The cursor-based scoped evaluator handles the larger steady-state issue by turning “evaluate this whole project again” into “discover which test cases changed since the last alert evaluation, then evaluate only those cases.”

We explored a Postgres pending-ID table, but dropped it in favor of the cursor approach. The cursor avoids writing one pending row per affected test case for every ingestion burst, relies on Oban for per-alert debounce, and uses the existing ClickHouse test_case_runs_by_inserted_at materialized view to discover changed IDs cheaply.

Expected Query Performance

The production example was a Last 75 rolling query that read and merged 1,000-entry states for about 300k project test cases: about 1.47 GiB read, about 4 GiB memory, and about 1.4-1.65s runtime.

For full baseline evaluations, the bucketed aggregate state should scale roughly with the bucket size:

Configured window	Query route	Exact result?	Expected full-project query impact
Last 50	100-run bucket, slice to 50	Yes	Reads about 10% of the old aggregate state, about 150 MiB instead of 1.47 GiB.
Last 75	100-run bucket, slice to 75	Yes	Same 10x state-size reduction. This is the production alert case and should move from the 1.4-1.65s range to low hundreds of ms if fixed ClickHouse overhead does not dominate.
Last 100	100-run bucket	Yes	Same 10x state-size reduction, with no post-bucket overread.
Last 247	250-run bucket, slice to 247	Yes	Reads about 25% of the old aggregate state, about 375 MiB instead of 1.47 GiB.
Last 250	250-run bucket	Yes	Same roughly 4x reduction, with no post-bucket overread.
Last 251	500-run bucket, slice to 251	Yes	Reads about 50% of the old aggregate state, about 750 MiB instead of 1.47 GiB.
Last 500	500-run bucket	Yes	Same roughly 2x reduction, with no post-bucket overread.
Last 501	750-run bucket, slice to 501	Yes	Reads about 75% of the old aggregate state, about 1.1 GiB instead of 1.47 GiB.
Last 750	750-run bucket	Yes	Same roughly 1.3x reduction, with no post-bucket overread.
Last 751-1,000, for example Last 983	Existing 1,000-run table, slice to configured window	Yes	No bucket-level I/O win for a full baseline query; it still needs the 1,000-entry state. The steady-state win comes from scoped evaluation.

For established alerts, query cost should scale with affected test cases instead of all project test cases:

Example	Approximate state read vs old full-project scan
Last 75, 5k affected cases out of 300k	About `1.47 GiB * 10% * 5k / 300k`, roughly 2.5 MiB of aggregate state plus fixed overhead.
Last 75, 50k affected cases out of 300k	Roughly 25 MiB of aggregate state plus fixed overhead.
Last 983, 5k affected cases out of 300k	About `1.47 GiB * 5k / 300k`, roughly 25 MiB plus fixed overhead.
Last 983, 50k affected cases out of 300k	Roughly 250 MiB plus fixed overhead.

Worst case, if a burst touches most of the project, the scoped job can approach the old project-wide cost for high windows like Last 983. The important difference is that this is tied to actual ingestion volume and debounced per alert, instead of running the project-wide scan continuously on cadence when nothing changed.

Accuracy

The configured rolling window remains exact. Bucket selection only chooses the minimum stored aggregate that can cover the requested window; the SQL still slices to the configured integer before computing flakiness or flaky-run count.

The cursor query does not decide alert state. It only identifies test cases with new runs since the last scoped pass. The existing monitor query still computes the exact alert result for those IDs, and transitions are diffed against active alert events for the same scoped IDs so unrelated active alerts are not recovered accidentally.

Validation

MIX_ENV=test mix ecto.reset
MIX_ENV=test mix test test/tuist/automations_test.exs:210 -> direct local ClickHouse coverage for the test_case_runs_by_inserted_at cursor query
MIX_ENV=test mix test test/tuist/automations_test.exs test/tuist/automations/workers/alert_evaluation_worker_test.exs test/tuist/automations/workers/automation_scheduler_test.exs test/tuist/automations/monitors/flaky_tests_monitor_test.exs test/tuist/automations/alerts/alert_test.exs -> 104 tests, 0 failures
MIX_ENV=test mix credo -> no issues
git diff --check
Earlier validation on this PR: local ClickHouse schema check for the 100, 250, 500, 750, and 1,000 recent-run aggregate tables

Comments

TA

tuist-atlas[bot] Jun 10, 2026

The flaky alert rolling evaluations optimization is now available in version xcresult-processor-image@0.14.1. To use these improvements, update to this version.