Hive
fix(server): flag only the failing run as flaky in cross-run detection
GitHub issue · Closed
What changed
Cross-run flaky detection now flags only the contradicted failure as is_flaky, never its passing siblings. A run is flaky iff it is a failure on a commit/scheme the same test case also passed on (a failure that did not reproduce).
To keep the flaky-group displays coherent (they filtered on is_flaky == true, so failure-only marking would have zeroed their “passed” counts), each view now identifies the flaky group by its is_flaky runs but fetches the full run history of the commit so the pass/fail breakdown stays accurate:
list_flaky_runs_for_test_case(test case Flaky Runs tab)get_flaky_run_group_for_test_case_run(single run view)fetch_cross_run_flaky_runs(PR comment / test-run view)
Why
A customer hit a 700% spike in auto-quarantined tests. Root cause: detection marked every run in a contradicting (test case, commit, scheme) group as flaky, passes included. So a test that ran 20 times on a commit and failed once was recorded as 20 flaky runs, not 1.
That inflation flows into everything that reads is_flaky:
flaky_run_countautomation (the customer’s auto-quarantine rule)flakiness_rateautomation + the test-case overview flakiness card- the global flaky dashboards and the
flaky_runs_countcolumn
Because the rule counts absolute flaky runs, a single flaky commit that CI retried enough times could trip a “N flaky runs in 7 days” threshold on its own, and a retry / flake-check burst could quarantine a whole suite in one evaluation. It also confused the UI: the Flaky Runs tab groups by commit, so users counted “2 recent flaky commits” while the automation counted ~23 individual flagged runs.
Root cause
check_cross_run_flakiness / filter_cross_run_flaky marked the current run and back-marked the opposite-status historical runs as flaky, regardless of which side was the pass. The new resolve_cross_run_flaky_failures flags only the failure side:
- this run failed and the test already passed on the commit → flag this failure
- this run passed and the test already failed on the commit → back-mark those earlier failures
- otherwise → nothing
Design trade-off: what should run-level is_flaky mean?
Two models were on the table:
- (A)
is_flaky= the flaky failure (this PR). A run is flaky only if it is a non-reproducing failure. The passing runs that prove the test can succeed stay clean. - (B)
is_flaky= part of a flaky episode (mark both passing and failing runs of a contradicting group), and have the automation count only the failures.
We went with (A). Reasoning:
-
Industry alignment. Trunk Flaky Tests, Datadog Test Optimization, and BuildPulse treat flakiness as a property of the test, not of individual runs. The runs are evidence: a “flaky run” is a non-reproducing failure; a passing run is just the passing evidence and is never labeled flaky. Tuist already has the test-level status (
TestCase.is_flaky, set by the automation); model (A) makes the run-level boolean mean the matching thing (the flaky failure). -
Single source of truth. With (A),
is_flakymeans one thing everywhere, soflaky_run_count,flakiness_rate, the overview card, and the global dashboards are all correct with no per-query status filter. Model (B) keeps passing runs flagged, so every current and future consumer has to rememberAND status = 'failure'; miss one and it is inflated again. That is exactly the bug that caused this incident (flakiness_rate = countIf(is_flaky)/countread ~100% for a 19-pass/1-fail commit). -
(B) is also more migration, not less. The automation reads the per-case daily MV’s
sumState(toUInt8(is_flaky))aggregate, which has no status dimension. To “count only failures” there you would have to add a flaky-failure aggregate to that MV plus the fourtest_case_runs_recent_N_per_casebucket MVs and the full recent-runs MV (whose rolling-window tuples carry(ran_at, is_flaky)and would need the failure bit too) plus the dashboard MV, each with a backfill. Model (A) fixes the existing aggregate for free by changing one ingestion function, with zero schema change. -
The one upside of (B), handled. (B)’s appeal is that the flaky episode becomes a one-line
is_flaky == truequery. We get the same view by grouping flaky failures by(scheme, commit)and refetching the commit’s runs (the three display changes above). If a first-class episode is ever wanted, it should be a separate marker rather than overloadingis_flakyto mean two things, which is what bit us here.
This also sets up an occurrence-based threshold (count distinct flaky commits) as a clean follow-up, since is_flaky now cleanly means “a run that flaked.”
Impact
flaky_run_count,flakiness_rate, and the overview card now reflect spurious failures rather than every execution of a flaky commit, so auto-quarantine thresholds behave as users expect.- Repetition-based flakiness (a single run with mixed retries) is unchanged.
- A passing run’s detail page no longer shows a “flaky” banner; the flaky group is still fully visible from the failing run and the test case Flaky Runs tab.
- Historical data is not backfilled: already-flaky commits keep their inflated counts until they age out of the evaluation window (7-30 days). New ingestion is correct immediately.
Validation
Run in this worktree:
mix test test/tuist/tests_test.exs-> 225/225- flaky monitor, analytics flakiness rate, vcs PR-comment, and the test-case / test-run / flaky-tests / test-case-run LiveView suites -> 79/79
- One assertion legitimately flipped (
flaky_runs_count2 -> 1) and was updated; that assertion is the behavior being fixed mix formatandmix credoclean
How to test locally
- On a CI run, run the same test case on one commit several times so it both passes and fails (a flaky commit).
- Open the test case’s Flaky Runs tab: the commit group still shows the full passed/failed split, but only the failing runs are counted as flaky.
- Check the flakiness rate /
flaky_run_count: the passing runs on that commit no longer inflate them.
🤖 Generated with Claude Code
No GitHub comments yet.