fix(server): flag only the failing run as flaky in cross-run detection

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11449

Updated

Jun 24, 2026

Domains

Atlas

Details

What changed

Cross-run flaky detection now flags only the contradicted failure as is_flaky, never its passing siblings. A run is flaky iff it is a failure on a commit/scheme the same test case also passed on (a failure that did not reproduce).

To keep the flaky-group displays coherent (they filtered on is_flaky == true, so failure-only marking would have zeroed their “passed” counts), each view now identifies the flaky group by its is_flaky runs but fetches the full run history of the commit so the pass/fail breakdown stays accurate:

list_flaky_runs_for_test_case (test case Flaky Runs tab)
get_flaky_run_group_for_test_case_run (single run view)
fetch_cross_run_flaky_runs (PR comment / test-run view)

Why

A customer hit a 700% spike in auto-quarantined tests. Root cause: detection marked every run in a contradicting (test case, commit, scheme) group as flaky, passes included. So a test that ran 20 times on a commit and failed once was recorded as 20 flaky runs, not 1.

That inflation flows into everything that reads is_flaky:

flaky_run_count automation (the customer’s auto-quarantine rule)
flakiness_rate automation + the test-case overview flakiness card
the global flaky dashboards and the flaky_runs_count column

Because the rule counts absolute flaky runs, a single flaky commit that CI retried enough times could trip a “N flaky runs in 7 days” threshold on its own, and a retry / flake-check burst could quarantine a whole suite in one evaluation. It also confused the UI: the Flaky Runs tab groups by commit, so users counted “2 recent flaky commits” while the automation counted ~23 individual flagged runs.

Root cause

check_cross_run_flakiness / filter_cross_run_flaky marked the current run and back-marked the opposite-status historical runs as flaky, regardless of which side was the pass. The new resolve_cross_run_flaky_failures flags only the failure side:

this run failed and the test already passed on the commit → flag this failure
this run passed and the test already failed on the commit → back-mark those earlier failures
otherwise → nothing

Design trade-off: what should run-level `is_flaky` mean?

Two models were on the table:

(A) is_flaky = the flaky failure (this PR). A run is flaky only if it is a non-reproducing failure. The passing runs that prove the test can succeed stay clean.
(B) is_flaky = part of a flaky episode (mark both passing and failing runs of a contradicting group), and have the automation count only the failures.

We went with (A). Reasoning:

Industry alignment. Trunk Flaky Tests, Datadog Test Optimization, and BuildPulse treat flakiness as a property of the test, not of individual runs. The runs are evidence: a “flaky run” is a non-reproducing failure; a passing run is just the passing evidence and is never labeled flaky. Tuist already has the test-level status (TestCase.is_flaky, set by the automation); model (A) makes the run-level boolean mean the matching thing (the flaky failure).
Single source of truth. With (A), is_flaky means one thing everywhere, so flaky_run_count, flakiness_rate, the overview card, and the global dashboards are all correct with no per-query status filter. Model (B) keeps passing runs flagged, so every current and future consumer has to remember AND status = 'failure'; miss one and it is inflated again. That is exactly the bug that caused this incident (flakiness_rate = countIf(is_flaky)/count read ~100% for a 19-pass/1-fail commit).
(B) is also more migration, not less. The automation reads the per-case daily MV’s sumState(toUInt8(is_flaky)) aggregate, which has no status dimension. To “count only failures” there you would have to add a flaky-failure aggregate to that MV plus the four test_case_runs_recent_N_per_case bucket MVs and the full recent-runs MV (whose rolling-window tuples carry (ran_at, is_flaky) and would need the failure bit too) plus the dashboard MV, each with a backfill. Model (A) fixes the existing aggregate for free by changing one ingestion function, with zero schema change.
The one upside of (B), handled. (B)’s appeal is that the flaky episode becomes a one-line is_flaky == true query. We get the same view by grouping flaky failures by (scheme, commit) and refetching the commit’s runs (the three display changes above). If a first-class episode is ever wanted, it should be a separate marker rather than overloading is_flaky to mean two things, which is what bit us here.

This also sets up an occurrence-based threshold (count distinct flaky commits) as a clean follow-up, since is_flaky now cleanly means “a run that flaked.”

Impact

flaky_run_count, flakiness_rate, and the overview card now reflect spurious failures rather than every execution of a flaky commit, so auto-quarantine thresholds behave as users expect.
Repetition-based flakiness (a single run with mixed retries) is unchanged.
A passing run’s detail page no longer shows a “flaky” banner; the flaky group is still fully visible from the failing run and the test case Flaky Runs tab.
Historical data is not backfilled: already-flaky commits keep their inflated counts until they age out of the evaluation window (7-30 days). New ingestion is correct immediately.

Validation

Run in this worktree:

mix test test/tuist/tests_test.exs -> 225/225
flaky monitor, analytics flakiness rate, vcs PR-comment, and the test-case / test-run / flaky-tests / test-case-run LiveView suites -> 79/79
One assertion legitimately flipped (flaky_runs_count 2 -> 1) and was updated; that assertion is the behavior being fixed
mix format and mix credo clean

How to test locally

On a CI run, run the same test case on one commit several times so it both passes and fails (a flaky commit).
Open the test case’s Flaky Runs tab: the commit group still shows the full passed/failed split, but only the failing runs are counted as flaky.
Check the flakiness rate / flaky_run_count: the passing runs on that commit no longer inflate them.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.