Hive Hive
Sign in

fix(server): re-arm flaky-test alerts when the condition clears, regardless of recovery

GitHub issue · Closed

Metadata
Source
tuist/tuist #11213
Updated
Jun 24, 2026
Details

Describe here the purpose of your PR.

What this fixes

Automation alerts fire their trigger actions only on a transition into the triggered set:

newly_triggered = triggered_ids \ already_triggered # already_triggered = latest event is "triggered"

A test case leaves already_triggered only when a recovered event is appended — and today that append is gated entirely behind recovery_enabled. So a recovery-disabled alert (for example the default per-project “Flaky test detection”) latches in triggered the first time a test case matches and never acts on it again, even after:

  • the condition clears and later returns, or
  • a different alert undoes the side effect (e.g. “Auto-mute flaky tests” recovery clearing is_flaky).

This surfaced on a real test case that kept flaking but was never re-marked: the alert responsible for marking it flaky was permanently latched, so it silently stopped acting.

What changed

Re-arming (the bookkeeping that lets the next rising edge fire) is now decoupled from the opt-in undo actions:

  • handle_recovery always runs. When a triggered test case clears its condition and the recovery dwell elapses, the alert appends a recovered event so it can fire again on the next rising edge.
  • Undo side effects stay opt-in: only recovery_enabled alerts pass their recovery_actions (unmute, remove label). A recovery-disabled alert re-arms but leaves its effect in place until a human clears it — a legitimate, intentional config, not a bug.

Why the recovery scope was reworked (avoiding a ClickHouse regression)

The recovery candidate set previously used the monitor’s all field. For recovery-disabled alerts that field was [] (cheap); populating it would have meant a per-tick SELECT DISTINCT id FROM test_cases for every project’s default flaky alert, because AutomationScheduler enqueues full evaluations for last_days alerts every cadence. Given the canary pool / alert-contention history, that’s not acceptable.

Instead, the recovery scope is now derived from the worker’s scoped_test_case_ids:

  • Scoped tick → only re-arm test cases in the evaluated chunk.
  • Full tick → every active event was measured, so all are fair game.

The all field was a no-op filter anyway (active events are always real test cases), so it and load_all_test_case_ids are removed entirely. Net result: cheaper than before, including for recovery-enabled alerts, with no new per-tick query.

Notes for reviewers

  • Re-arming still respects the recovery dwell (hysteresis) to avoid flapping re-fires on notification alerts. For recovery-disabled alerts with no recovery_config it falls back to the existing 14-day default — moot in practice for the affected alerts (Flaky-detection’s condition only clears after 30d clean; Auto-disable’s re-fire is an idempotent no-op).
  • This fixes the latching class of bug (alerts that silently stop acting). It does not, by itself, re-mark a test whose condition never went false — that case is two alerts contending over the same is_flaky field, which is a separate single-writer / desired-state change.

How to test locally

```bash cd server mix test test/tuist/automations/workers/alert_evaluation_worker_test.exs
test/tuist/automations/monitors/flaky_tests_monitor_test.exs ```

Validation run in this branch:

  • alert_evaluation_worker_test.exs — 23 pass (incl. two new tests: re-arms-but-withholds-actions when recovery is disabled, and respects-the-dwell)
  • flaky_tests_monitor_test.exs — 18 pass
  • Full test/tuist/automations/ — 113 pass
  • mix format --check-formatted clean, mix credo no issues

Behavior change / migration note

Before this change, recovery_enabled: false effectively meant “fire the trigger actions once, then stay latched forever.” With re-arming decoupled from recovery, a recovery-disabled alert now re-arms as soon as its condition no longer holds (no dwell — recovery means exactly that). Practical impact:

  • For idempotent trigger actions (mute, mark-flaky, disable) the re-fire is a no-op.
  • For a send_slack trigger on a flaky condition, this introduces a re-ping on the natural flap cycle (each time the test goes clean and then flaky again). Anyone relying on the old fire-once behavior can move to recovery_enabled: true with empty recovery_actions to keep the alert latched until they act.
Comments
TA
tuist-atlas[bot] Jun 12, 2026

The fix for re-arming flaky-test alerts when the condition clears, regardless of recovery, is now available in version xcresult-processor-image@0.17.1. Please update to this version.

TA
tuist-atlas[bot] Jun 12, 2026

This fix is now available in server@1.210.1. Update to this version to enable re-arming of flaky-test alerts when the condition clears, regardless of recovery configuration.