Hive
fix(server): re-arm flaky-test alerts when the condition clears, regardless of recovery
GitHub issue · Closed
Describe here the purpose of your PR.
What this fixes
Automation alerts fire their trigger actions only on a transition into the triggered set:
newly_triggered = triggered_ids \ already_triggered # already_triggered = latest event is "triggered"
A test case leaves already_triggered only when a recovered event is appended — and today that append is gated entirely behind recovery_enabled. So a recovery-disabled alert (for example the default per-project “Flaky test detection”) latches in triggered the first time a test case matches and never acts on it again, even after:
- the condition clears and later returns, or
- a different alert undoes the side effect (e.g. “Auto-mute flaky tests” recovery clearing
is_flaky).
This surfaced on a real test case that kept flaking but was never re-marked: the alert responsible for marking it flaky was permanently latched, so it silently stopped acting.
What changed
Re-arming (the bookkeeping that lets the next rising edge fire) is now decoupled from the opt-in undo actions:
handle_recoveryalways runs. When a triggered test case clears its condition and the recovery dwell elapses, the alert appends arecoveredevent so it can fire again on the next rising edge.- Undo side effects stay opt-in: only
recovery_enabledalerts pass theirrecovery_actions(unmute, remove label). A recovery-disabled alert re-arms but leaves its effect in place until a human clears it — a legitimate, intentional config, not a bug.
Why the recovery scope was reworked (avoiding a ClickHouse regression)
The recovery candidate set previously used the monitor’s all field. For recovery-disabled alerts that field was [] (cheap); populating it would have meant a per-tick SELECT DISTINCT id FROM test_cases for every project’s default flaky alert, because AutomationScheduler enqueues full evaluations for last_days alerts every cadence. Given the canary pool / alert-contention history, that’s not acceptable.
Instead, the recovery scope is now derived from the worker’s scoped_test_case_ids:
- Scoped tick → only re-arm test cases in the evaluated chunk.
- Full tick → every active event was measured, so all are fair game.
The all field was a no-op filter anyway (active events are always real test cases), so it and load_all_test_case_ids are removed entirely. Net result: cheaper than before, including for recovery-enabled alerts, with no new per-tick query.
Notes for reviewers
- Re-arming still respects the recovery dwell (hysteresis) to avoid flapping re-fires on notification alerts. For recovery-disabled alerts with no
recovery_configit falls back to the existing 14-day default — moot in practice for the affected alerts (Flaky-detection’s condition only clears after 30d clean; Auto-disable’s re-fire is an idempotent no-op). - This fixes the latching class of bug (alerts that silently stop acting). It does not, by itself, re-mark a test whose condition never went false — that case is two alerts contending over the same
is_flakyfield, which is a separate single-writer / desired-state change.
How to test locally
```bash
cd server
mix test test/tuist/automations/workers/alert_evaluation_worker_test.exs
test/tuist/automations/monitors/flaky_tests_monitor_test.exs
```
Validation run in this branch:
alert_evaluation_worker_test.exs— 23 pass (incl. two new tests: re-arms-but-withholds-actions when recovery is disabled, and respects-the-dwell)flaky_tests_monitor_test.exs— 18 pass- Full
test/tuist/automations/— 113 pass mix format --check-formattedclean,mix credono issues
Behavior change / migration note
Before this change, recovery_enabled: false effectively meant “fire the trigger actions once, then stay latched forever.” With re-arming decoupled from recovery, a recovery-disabled alert now re-arms as soon as its condition no longer holds (no dwell — recovery means exactly that). Practical impact:
- For idempotent trigger actions (mute, mark-flaky, disable) the re-fire is a no-op.
- For a
send_slacktrigger on a flaky condition, this introduces a re-ping on the natural flap cycle (each time the test goes clean and then flaky again). Anyone relying on the old fire-once behavior can move torecovery_enabled: truewith emptyrecovery_actionsto keep the alert latched until they act.