Flaky-run detection now counts only failing runs

Cross-run flaky detection used to mark every run in a contradicting group as flaky, including the passing runs. That meant a single test failure on a commit could inflate flaky_run_count, flakiness_rate, and the overview flakiness card by the total number of runs on that commit. With this fix, only the non-reproducing failure is flagged as flaky, while the passing runs that prove the test can succeed stay clean. This prevents auto-quarantine thresholds from tripping early because of repeated retries on a flaky commit, and makes the Flaky Runs tab, single-run view, and PR comment views show the accurate pass/fail split. Historical data is not backfilled, so existing inflated counts will age out of their evaluation windows.