feat(server): base shard timing on p90 of all CI runs, not default-branch avg

Metadata

Source

tuist/tuist #11310

Updated

Jun 24, 2026

Domains

Compute

Details

What changed

Shard timing used for bin-packing was computed from the average test module/suite duration over the last 30 days, filtered to the project’s default_branch. This PR drops the branch filter and switches the estimator from avg to quantile(0.90).

Why

Branch filter discarded most of the sample for no real signal. Sharding happens overwhelmingly on PR/feature branches, but timing was sourced only from default_branch. Any unit not recently run on the default branch (a brand-new module, or one skipped by selective testing on main) wasn’t found in the timing map and fell back to the global median of all units, not its own history. A heavy module assigned the median badly unbalances shards, which is exactly what bin-packing exists to prevent. Since a test’s duration is essentially branch-independent, filtering to one branch threw away the large majority of usable samples for marginal noise reduction.
avg is outlier-sensitive. A single slow or flaky run inflates the estimate. quantile(0.90) is robust to that while still being conservative (slightly high), which is the safer bias for shard balancing.

Design / precedent

This aligns the timing model with how mature test-splitters work, in particular Buildkite Test Engine, which uses a bounded time window + a percentile estimator + a median fallback for units without history. We keep all three; the existing 30-day window is retained rather than widening it.

A count-based window (last N runs) was considered and rejected in favor of keeping the date window. quantile(0.90)(?) matches the established ClickHouse pattern already used in builds.ex and tests/analytics.ex. The is_ci, lookback, and median/hardcoded fallback behavior are unchanged.

Impact

Shard plans get a real per-unit p90 estimate from the full CI sample instead of falling back to the median for anything the default branch didn’t cover, so shard balancing should improve, especially for projects whose main runs are sparse relative to PR runs.

How to test locally

mix test test/tuist/shards_test.exs

Includes a new test asserting that a run on a non-default branch now feeds the timing estimate (estimated_duration_ms == 100_000). Full suite: 19 tests, 0 failures; mix credo lib/tuist/shards.ex clean.

Comments

TA

tuist-atlas[bot] Jun 17, 2026

This change is now available in xcresult-processor-image@0.24.0. Update to this version to use the new p90-based shard timing estimator that draws from all CI runs rather than filtering to the default branch.

TA

tuist-atlas[bot] Jun 17, 2026

This change is now available in server@1.213.0. Update to that version to use shard timing based on p90 of all CI runs instead of the default-branch average.