Hive Hive
Sign in

fix(server): optimize flaky alert rolling evaluations

GitHub issue · Closed

Metadata
Source
tuist/tuist #10816
Updated
Jun 24, 2026
Details

Summary

This keeps the max rolling window at 1,000, but removes the expensive full-project ClickHouse scan from the steady-state path for flaky-test alerts.

  • Add bucketed recent-run aggregate tables for 100, 250, 500, and 750 run windows, while keeping the existing 1,000-run aggregate as the fallback for large windows.
  • Route arbitrary rolling-window integers to the smallest bucket that can satisfy them, then still slice to the exact configured value. For example, 247 reads the 250 bucket and slices to 247; 983 reads the existing 1,000 bucket and slices to 983.
  • Replace the transient pending-test-case table with a compact cursor on automation_alerts.last_scoped_evaluation_inserted_at.
  • After test case run ingestion, enqueue one unique delayed Oban job per enabled flaky alert, debounced until after the ClickHouse flush interval. The job does not store test case IDs in Postgres.
  • Once an alert has a baseline, the worker queries test_case_runs_by_inserted_at for distinct test case IDs inserted since the alert cursor, with a small overlap window, then evaluates those IDs in 1,000-ID chunks through the existing scoped monitor path.
  • Initial baselines still run as full evaluations. Monitor definition changes reset the baseline, so changed definitions get one fresh full pass before returning to scoped incremental evaluation.

Why

The slow production query was paying two costs repeatedly:

  • It read oversized 1,000-entry aggregate states even for small windows like Last 75.
  • It scanned every test case in the project on each scheduled rolling evaluation, even though only test cases with newly ingested runs can change after a baseline exists.

The bucketed aggregates reduce state size for small and medium rolling windows. The cursor-based scoped evaluator handles the larger steady-state issue by turning “evaluate this whole project again” into “discover which test cases changed since the last alert evaluation, then evaluate only those cases.”

We explored a Postgres pending-ID table, but dropped it in favor of the cursor approach. The cursor avoids writing one pending row per affected test case for every ingestion burst, relies on Oban for per-alert debounce, and uses the existing ClickHouse test_case_runs_by_inserted_at materialized view to discover changed IDs cheaply.

Expected Query Performance

The production example was a Last 75 rolling query that read and merged 1,000-entry states for about 300k project test cases: about 1.47 GiB read, about 4 GiB memory, and about 1.4-1.65s runtime.

For full baseline evaluations, the bucketed aggregate state should scale roughly with the bucket size:

Configured window Query route Exact result? Expected full-project query impact
Last 50 100-run bucket, slice to 50 Yes Reads about 10% of the old aggregate state, about 150 MiB instead of 1.47 GiB.
Last 75 100-run bucket, slice to 75 Yes Same 10x state-size reduction. This is the production alert case and should move from the 1.4-1.65s range to low hundreds of ms if fixed ClickHouse overhead does not dominate.
Last 100 100-run bucket Yes Same 10x state-size reduction, with no post-bucket overread.
Last 247 250-run bucket, slice to 247 Yes Reads about 25% of the old aggregate state, about 375 MiB instead of 1.47 GiB.
Last 250 250-run bucket Yes Same roughly 4x reduction, with no post-bucket overread.
Last 251 500-run bucket, slice to 251 Yes Reads about 50% of the old aggregate state, about 750 MiB instead of 1.47 GiB.
Last 500 500-run bucket Yes Same roughly 2x reduction, with no post-bucket overread.
Last 501 750-run bucket, slice to 501 Yes Reads about 75% of the old aggregate state, about 1.1 GiB instead of 1.47 GiB.
Last 750 750-run bucket Yes Same roughly 1.3x reduction, with no post-bucket overread.
Last 751-1,000, for example Last 983 Existing 1,000-run table, slice to configured window Yes No bucket-level I/O win for a full baseline query; it still needs the 1,000-entry state. The steady-state win comes from scoped evaluation.

For established alerts, query cost should scale with affected test cases instead of all project test cases:

Example Approximate state read vs old full-project scan
Last 75, 5k affected cases out of 300k About 1.47 GiB * 10% * 5k / 300k, roughly 2.5 MiB of aggregate state plus fixed overhead.
Last 75, 50k affected cases out of 300k Roughly 25 MiB of aggregate state plus fixed overhead.
Last 983, 5k affected cases out of 300k About 1.47 GiB * 5k / 300k, roughly 25 MiB plus fixed overhead.
Last 983, 50k affected cases out of 300k Roughly 250 MiB plus fixed overhead.

Worst case, if a burst touches most of the project, the scoped job can approach the old project-wide cost for high windows like Last 983. The important difference is that this is tied to actual ingestion volume and debounced per alert, instead of running the project-wide scan continuously on cadence when nothing changed.

Accuracy

The configured rolling window remains exact. Bucket selection only chooses the minimum stored aggregate that can cover the requested window; the SQL still slices to the configured integer before computing flakiness or flaky-run count.

The cursor query does not decide alert state. It only identifies test cases with new runs since the last scoped pass. The existing monitor query still computes the exact alert result for those IDs, and transitions are diffed against active alert events for the same scoped IDs so unrelated active alerts are not recovered accidentally.

Validation

  • MIX_ENV=test mix ecto.reset
  • MIX_ENV=test mix test test/tuist/automations_test.exs:210 -> direct local ClickHouse coverage for the test_case_runs_by_inserted_at cursor query
  • MIX_ENV=test mix test test/tuist/automations_test.exs test/tuist/automations/workers/alert_evaluation_worker_test.exs test/tuist/automations/workers/automation_scheduler_test.exs test/tuist/automations/monitors/flaky_tests_monitor_test.exs test/tuist/automations/alerts/alert_test.exs -> 104 tests, 0 failures
  • MIX_ENV=test mix credo -> no issues
  • git diff --check
  • Earlier validation on this PR: local ClickHouse schema check for the 100, 250, 500, 750, and 1,000 recent-run aggregate tables
Comments
TA
tuist-atlas[bot] Jun 10, 2026

The flaky alert rolling evaluations optimization is now available in version xcresult-processor-image@0.14.1. To use these improvements, update to this version.