Hive
feat(server): base shard timing on p90 of all CI runs, not default-branch avg
GitHub issue · Closed
What changed
Shard timing used for bin-packing was computed from the average test module/suite duration over the last 30 days, filtered to the project’s default_branch. This PR drops the branch filter and switches the estimator from avg to quantile(0.90).
Why
- Branch filter discarded most of the sample for no real signal. Sharding happens overwhelmingly on PR/feature branches, but timing was sourced only from
default_branch. Any unit not recently run on the default branch (a brand-new module, or one skipped by selective testing onmain) wasn’t found in the timing map and fell back to the global median of all units, not its own history. A heavy module assigned the median badly unbalances shards, which is exactly what bin-packing exists to prevent. Since a test’s duration is essentially branch-independent, filtering to one branch threw away the large majority of usable samples for marginal noise reduction. avgis outlier-sensitive. A single slow or flaky run inflates the estimate.quantile(0.90)is robust to that while still being conservative (slightly high), which is the safer bias for shard balancing.
Design / precedent
This aligns the timing model with how mature test-splitters work, in particular Buildkite Test Engine, which uses a bounded time window + a percentile estimator + a median fallback for units without history. We keep all three; the existing 30-day window is retained rather than widening it.
A count-based window (last N runs) was considered and rejected in favor of keeping the date window. quantile(0.90)(?) matches the established ClickHouse pattern already used in builds.ex and tests/analytics.ex. The is_ci, lookback, and median/hardcoded fallback behavior are unchanged.
Impact
Shard plans get a real per-unit p90 estimate from the full CI sample instead of falling back to the median for anything the default branch didn’t cover, so shard balancing should improve, especially for projects whose main runs are sparse relative to PR runs.
How to test locally
mix test test/tuist/shards_test.exs
Includes a new test asserting that a run on a non-default branch now feeds the timing estimate (estimated_duration_ms == 100_000). Full suite: 19 tests, 0 failures; mix credo lib/tuist/shards.ex clean.
This change is now available in server@1.213.0. Update to that version to use shard timing based on p90 of all CI runs instead of the default-branch average.