fix(server): improve storage artifact retention

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11398

Updated

Jun 24, 2026

Domains

Storage

Details

What changed

Added hosted retention coverage for legacy CAS cache artifacts and scheduled it alongside the existing cache retention workers.
Kept Xcode, CAS, module, and Gradle cache artifacts on the shared plan-based cache retention policy, including case-insensitive managed account handle resolution for object prefixes.
Increased bulk object-delete timeouts and made async delete timeouts return tagged errors so Oban can retry and surface them instead of crashing the worker process.
Report Oban retry failures to Sentry with worker, queue, and state tags, not only final discards.
Added an account-scoped run session archive retention worker that deletes expired session archives from primary storage while preserving command event metadata.
Updated the server JS lockfile resolution for ts-deepmerge to 8.0.0 after CI exposed a Trivy vulnerability in the previous transitive version.

Why

The production storage investigation showed that Tigris usage was still higher than expected even though old-object deletion workers were running. The first expectation gap was cache-related: Xcode cache objects were covered, but legacy CAS cache objects did not have equivalent hosted retention coverage, and some managed account prefixes could be skipped when object handle casing differed from the database row.

The bucket review also found run session archives in primary storage that were not covered by the existing daily artifact-retention scheduler. Those archives are user-generated blobs with the same retention expectations as other run artifacts, so leaving them out meant storage could keep growing even when build archives, previews, test attachments, shard bundles, and cache artifacts were cleaned up.

Operationally, storage-retention failures also needed earlier visibility. A retrying Oban job could fail repeatedly without a Sentry event until final discard, which made it too easy to miss transient storage provider, timeout, or batch-size problems.

Root cause

Legacy CAS cache objects were not scheduled through their own hosted retention worker, so the cache cleanup surface did not match the actual object families in the cache bucket. Managed account resolution also depended on exact handle casing from object keys, which could skip valid hosted accounts when the stored prefix differed in case.

Run session archives are keyed from command events and stored under the run artifact prefix, but the retention scheduler only covered the DB-backed artifact families that already had workers. Since command event metadata lives separately from the object blob, the object needed its own retention worker keyed by command event run time.

The CI failure had a separate root cause: the server lockfile still resolved transitive ts-deepmerge@7.0.3, and Trivy flagged CVE-2026-12644, which is fixed in 8.0.0.

Approach

The cleanup changes stay inside the existing retention architecture rather than adding a one-off bucket scanner. Plan windows remain centralized in Tuist.Storage.RetentionPolicy, account-scoped progress continues to use persisted artifact retention cursors, and deletion still goes through the existing storage abstraction.

For run sessions, the worker queries command events by ran_at, derives the existing session object key, deletes only the archive blob, and keeps the command event row intact for dashboards and analytics. That matches the behavior of the other artifact-retention workers, where old blobs are removed without deleting historical metadata.

For the dependency vulnerability, the fix uses the existing pnpm.overrides mechanism and updates only the targeted ts-deepmerge lockfile entries instead of re-resolving the broader JavaScript dependency graph.

Impact

Hosted storage retention now covers the cache and run-session object families found during the bucket review, which should reduce retained Tigris data according to the existing plan windows. Retryable retention failures should show up in Sentry earlier, making timeout or provider issues easier to diagnose before a job reaches final discard.

The metadata users see in dashboards remains intact; this changes deletion of expired blobs, not historical run, build, test, preview, or command event records. Production object keys, account handles, credentials, and cluster-specific continuation tokens are intentionally omitted from this description.

Validation

cd server && mix test test/tuist/storage/cache_artifact_retention_test.exs test/tuist/storage/workers/delete_expired_cache_artifact_workers_test.exs test/tuist/oban/runtime_config_test.exs test/tuist/storage_test.exs
cd tuist_common && mix test test/tuist_common/oban_telemetry_test.exs
cd server && mix test test/tuist/storage/workers/delete_expired_artifact_workers_test.exs test/tuist/storage/workers/schedule_expired_artifacts_worker_test.exs test/tuist/storage/artifact_retention_cursor_test.exs test/tuist/storage/cache_artifact_retention_test.exs test/tuist/storage/workers/delete_expired_cache_artifact_workers_test.exs
cd server && aube install --frozen-lockfile
cd server && ./mise/tasks/security.sh
git diff --check
gh pr checks 11398 shows the current head’s visible checks passing, including Lint PR and CodeQL. The earlier Server / Security failure was tied to the pre-rebase SHA and is covered locally by the security task above.

Comments

No GitHub comments yet.