Hive
fix(server): match suite-granularity shard timing by module-qualified name
GitHub issue · Closed
What changed
Tuist.Shards.fetch_timing_data/2 for suite granularity now joins test_suite_runs to test_module_runs and keys the historical timing map on concat(module.name, '/', suite.name) instead of the bare suite name.
Why
Suite-granularity shard plans collapsed to a single shard regardless of the configured per-shard duration target, so a job that should fan out across many shards ran as one long shard.
Root cause
The two sides of the timing lookup built keys differently:
- The CLI sends each suite as
Module/Suite(blueprintName/className), e.g.AppTests/LoginTests(ShardPlanService.swift). - The server grouped
test_suite_runson the barenamecolumn, producing a map keyed byLoginTests— no module prefix.
So every Map.get(timing_data, "AppTests/LoginTests", default) in assign_durations/3 missed and fell back to the median duration. With every suite assigned the same tiny median, the bin packer’s total estimate was ~100x too low and BinPacker.determine_shard_count/3 computed ceil(total / target) = 1.
Module granularity was never affected because module names are both stored and sent in the same bare form, so those lookups match (which is exactly why module plans fan out correctly while every suite plan collapsed to 1).
Why this fix
Keying the timing query on the module-qualified name makes the server’s map match what the CLI sends. A join (rather than stripping the module prefix off the CLI input) is also more correct: the previous bare-name grouping conflated same-named suites that live in different modules.
Validation
- Added a regression test (
matches suite timing data by module-qualified name) that seeds two suites in the same module with very different durations and asserts each is assigned its real per-suite timing. Verified it fails against the old code (returns the median, 45500, instead of 90000) and passes with the fix. - Full
test/tuist/shards_test.exssuite green (24/24).mix credoandmix formatclean on the changed file. - Empirically confirmed against production data for an affected project: the broken lookup estimated a 783-suite plan at ~101s total (→ 1 shard), while the module-qualified join matches all 783 suites against real CI timing and estimates ~2.8h total (→ ~17 shards needed at a 10-minute target).
Note for affected users
The needed shard count is still clamped to --shard-max (default 10) in BinPacker.determine_shard_count/3. To hit a tight per-shard target on a large suite set, --shard-max must be raised accordingly; otherwise shards stay larger than the target.
No GitHub comments yet.