Hive Hive
Sign in

feat(infra): enable PSI + add CPU-busy% for Linux runner vitals

GitHub issue · Closed

Metadata
Source
tuist/tuist #11108
Updated
Jun 24, 2026
Domains
Compute
Details

What and why

The first production RUNNER_VITALS data (from #11094, now live) revealed two gaps:

  1. Every PSI field was empty. /proc/pressure/* doesn’t exist in the kata guest — the kata kernel ships CONFIG_PSI=y but boots with PSI disabled. PSI was the dedicated CPU/memory-starvation signal (the half of “lost communication” that isn’t OOM), so we were blind to it.
  2. No PSI-independent CPU signal beyond a coarse loadavg.

The change (two scopes)

  • infra (runners-controller): set psi=1 on the kata guest kernel cmdline for Linux kata runner Pods, via the io.katacontainers.config.hypervisor.kernel_params pod annotation. It’s honored because the containerd kata runtime whitelists io.katacontainers.* annotations. macOS pods aren’t kata, so they don’t get it. Covered by tests (TestBuild_LinuxPodEnablesPSIViaKataAnnotation, TestBuild_MacOSPodHasNoKataKernelParamsAnnotation).
  • linux-runner-image (vitals.sh):
    • Add cpu.busy.pct from /proc/stat deltas — a guest-wide CPU-utilization signal that works regardless of PSI, so the CPU-starvation dimension is covered even if psi=1 doesn’t take on some kernel.
    • Omit PSI fields when /proc/pressure is absent, so we never log the empty cpu.psi.some.avg10= noise seen in the first data.

Why both, not just psi=1

I can’t verify the kata kernel’s CONFIG_PSI offline, so shipping psi=1 alone risks a blind deploy-and-hope round-trip. The /proc/stat CPU% gives a reliable starvation signal independently; if psi=1 does take, PSI’s stall-time fields are a bonus on top.

Validation

  • go build / go vet / go test ./internal/podtemplate/... clean; new annotation tests pass.
  • bash -n + shellcheck -S warning clean on vitals.sh; CPU% math spot-checked.

Deploy note

Both halves apply via the production server deployment (the runners-controller image pin + the runner image pin), same path that just landed #11094. Confirm in Grafana after rollout that cpu.busy.pct appears and the PSI fields populate (or are cleanly absent).

Comments
TA
tuist-atlas[bot] Jun 6, 2026

The changes from this pull request are now available in Runners Controller 0.9.0.

You can update to the new Docker image:

ghcr.io/tuist/tuist-runners-controller:0.9.0
TA
tuist-atlas[bot] Jun 6, 2026

The changes in this pull request are now available in linux-runner-image@0.4.0 (Docker image: ghcr.io/tuist/tuist-linux-runner:0.4.0). Update to this version to enable PSI and the CPU-busy% metric for Linux runner vitals.