Hive Hive
Sign in

fix(server): flag only the failing run as flaky in cross-run detection

GitHub issue · Closed

Metadata
Source
tuist/tuist #11449
Updated
Jun 24, 2026
Domains
Atlas
Details

What changed

Cross-run flaky detection now flags only the contradicted failure as is_flaky, never its passing siblings. A run is flaky iff it is a failure on a commit/scheme the same test case also passed on (a failure that did not reproduce).

To keep the flaky-group displays coherent (they filtered on is_flaky == true, so failure-only marking would have zeroed their “passed” counts), each view now identifies the flaky group by its is_flaky runs but fetches the full run history of the commit so the pass/fail breakdown stays accurate:

  • list_flaky_runs_for_test_case (test case Flaky Runs tab)
  • get_flaky_run_group_for_test_case_run (single run view)
  • fetch_cross_run_flaky_runs (PR comment / test-run view)

Why

A customer hit a 700% spike in auto-quarantined tests. Root cause: detection marked every run in a contradicting (test case, commit, scheme) group as flaky, passes included. So a test that ran 20 times on a commit and failed once was recorded as 20 flaky runs, not 1.

That inflation flows into everything that reads is_flaky:

  • flaky_run_count automation (the customer’s auto-quarantine rule)
  • flakiness_rate automation + the test-case overview flakiness card
  • the global flaky dashboards and the flaky_runs_count column

Because the rule counts absolute flaky runs, a single flaky commit that CI retried enough times could trip a “N flaky runs in 7 days” threshold on its own, and a retry / flake-check burst could quarantine a whole suite in one evaluation. It also confused the UI: the Flaky Runs tab groups by commit, so users counted “2 recent flaky commits” while the automation counted ~23 individual flagged runs.

Root cause

check_cross_run_flakiness / filter_cross_run_flaky marked the current run and back-marked the opposite-status historical runs as flaky, regardless of which side was the pass. The new resolve_cross_run_flaky_failures flags only the failure side:

  • this run failed and the test already passed on the commit → flag this failure
  • this run passed and the test already failed on the commit → back-mark those earlier failures
  • otherwise → nothing

Design trade-off: what should run-level is_flaky mean?

Two models were on the table:

  • (A) is_flaky = the flaky failure (this PR). A run is flaky only if it is a non-reproducing failure. The passing runs that prove the test can succeed stay clean.
  • (B) is_flaky = part of a flaky episode (mark both passing and failing runs of a contradicting group), and have the automation count only the failures.

We went with (A). Reasoning:

  • Industry alignment. Trunk Flaky Tests, Datadog Test Optimization, and BuildPulse treat flakiness as a property of the test, not of individual runs. The runs are evidence: a “flaky run” is a non-reproducing failure; a passing run is just the passing evidence and is never labeled flaky. Tuist already has the test-level status (TestCase.is_flaky, set by the automation); model (A) makes the run-level boolean mean the matching thing (the flaky failure).

  • Single source of truth. With (A), is_flaky means one thing everywhere, so flaky_run_count, flakiness_rate, the overview card, and the global dashboards are all correct with no per-query status filter. Model (B) keeps passing runs flagged, so every current and future consumer has to remember AND status = 'failure'; miss one and it is inflated again. That is exactly the bug that caused this incident (flakiness_rate = countIf(is_flaky)/count read ~100% for a 19-pass/1-fail commit).

  • (B) is also more migration, not less. The automation reads the per-case daily MV’s sumState(toUInt8(is_flaky)) aggregate, which has no status dimension. To “count only failures” there you would have to add a flaky-failure aggregate to that MV plus the four test_case_runs_recent_N_per_case bucket MVs and the full recent-runs MV (whose rolling-window tuples carry (ran_at, is_flaky) and would need the failure bit too) plus the dashboard MV, each with a backfill. Model (A) fixes the existing aggregate for free by changing one ingestion function, with zero schema change.

  • The one upside of (B), handled. (B)’s appeal is that the flaky episode becomes a one-line is_flaky == true query. We get the same view by grouping flaky failures by (scheme, commit) and refetching the commit’s runs (the three display changes above). If a first-class episode is ever wanted, it should be a separate marker rather than overloading is_flaky to mean two things, which is what bit us here.

This also sets up an occurrence-based threshold (count distinct flaky commits) as a clean follow-up, since is_flaky now cleanly means “a run that flaked.”

Impact

  • flaky_run_count, flakiness_rate, and the overview card now reflect spurious failures rather than every execution of a flaky commit, so auto-quarantine thresholds behave as users expect.
  • Repetition-based flakiness (a single run with mixed retries) is unchanged.
  • A passing run’s detail page no longer shows a “flaky” banner; the flaky group is still fully visible from the failing run and the test case Flaky Runs tab.
  • Historical data is not backfilled: already-flaky commits keep their inflated counts until they age out of the evaluation window (7-30 days). New ingestion is correct immediately.

Validation

Run in this worktree:

  • mix test test/tuist/tests_test.exs -> 225/225
  • flaky monitor, analytics flakiness rate, vcs PR-comment, and the test-case / test-run / flaky-tests / test-case-run LiveView suites -> 79/79
  • One assertion legitimately flipped (flaky_runs_count 2 -> 1) and was updated; that assertion is the behavior being fixed
  • mix format and mix credo clean

How to test locally

  1. On a CI run, run the same test case on one commit several times so it both passes and fails (a flaky commit).
  2. Open the test case’s Flaky Runs tab: the commit group still shows the full passed/failed split, but only the failing runs are counted as flaky.
  3. Check the flakiness rate / flaky_run_count: the passing runs on that commit no longer inflate them.

🤖 Generated with Claude Code

Comments

No GitHub comments yet.