feat(infra): Linux runners on Hetzner Robot bare metal with Kata Containers QEMU microVMs and queue-driven autoscaling

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #10794

Updated

Jun 24, 2026

Domains

Compute

Details

Summary

Adds substrate for Linux runners as a sibling to the existing macOS Mac mini fleet. Hetzner Robot bare-metal hosts adopted by caph, runner Pods are QEMU microVMs via Kata Containers, multi-tenant density per host. Queue-depth-driven autoscaling from v1.

This PR lands the end-state architecture in one go: no cloud-first staircase, no “containers now, microVMs later” intermediate step. Substrate, runtime, and tenancy story are all what we’d ship if we were starting from scratch with full hindsight.

Why these choices

Bare metal, not cloud. Hetzner Cloud doesn’t expose KVM to guests, which kills the microVM path. Bare metal lifts that ceiling AND drops the cost-per-CPU-minute substantially once a host crosses ~40% sustained utilization.
caph-managed adoption. HetznerBareMetalHost CRs are created/owned by a new hetzner-robot-controller that reflects Hetzner Robot inventory (servers named tuist-bm-*) into the cluster and auto-fills WWNs once caph populates hardware details. Operator workflow: order an AX-class server in Robot → bump the MD replicas → done. No hand-authored host CRs.
Kata Containers + QEMU for the runtime. Each runner Pod is a microVM with its own kernel: real per-tenant kernel isolation against arbitrary CI workloads. This is what AKS Confidential Containers, IBM Cloud sandboxed containers, and OpenShift sandboxed containers all ship; the Kata + Firecracker combination, while AWS-friendly, is rarely run in production k8s because Firecracker has no native 9pfs/virtio-fs and requires the nydus snapshotter (plus a userspace nydusd daemon) to deliver rootfs to the microVM. Kata-qemu reads the host’s overlayfs directory rootfs directly via virtio-fs — no extra snapshotter packaging.
Kata pre-baked into the Node OS, not installed at runtime. kata-static (the official Kata distribution) is unpacked into /opt/kata during the chrooted postInstallScript phase of installimage, before containerd ever starts. The containerd kata-qemu runtime drop-in lands in /etc/containerd/conf.d/ at the same time. When kubelet boots containerd, the kata-qemu handler is already registered. This is the model AKS / Bottlerocket Kata / Talos Kata extensions use; the runtime kata-deploy DaemonSet is upstream’s demo path and proved unworkable on a 64 GB bare-metal Node where every install-induced containerd restart bounced Cilium for ~5-7 min.
Hetzner Robot AX-class as the substrate. AX42-U (8c/16t Zen 4, 64 GB DDR5, 2x512 GB NVMe) for staging at ~€57/mo + €234 setup, AX162-R (48c/96t EPYC Genoa, 256 GB ECC, 2x1.92 TB NVMe DC Edition) for production. Same Zen 4 family across both so kernel tuning transfers. 3-5× cheaper than AWS .metal at the SKU range we care about.
“Lead the demand” warm pool from v1. Server returns rolling p95 of concurrent claims; controller floors the warm pool at it. Kata-qemu cold-start is ~600 ms vs Firecracker’s ~125 ms — irrelevant against GitHub’s webhook → dispatch latency that already dominates the end-to-end path by an order of magnitude.
Server is the signal source, controller is the policy engine. desired_replicas returns raw counts (claimed, queued, p95_concurrent_last_hour); the controller composes them with minWarmPoolFloor + maxReplicas. Tuning lives in chart values, not server config.

Why Kata Containers (and which hypervisor)

The constraint: Tuist runs arbitrary customer code in a multi-tenant SaaS environment, so tenant-A and tenant-B kernels must not be the same kernel. That rules out vanilla runc (shared kernel; one container-escape CVE = full cross-tenant breach). Six categories considered:

Approach	Isolation	k8s-native	Status
`runc` + namespaces	Shared kernel	✓	Insufficient for arbitrary multi-tenant code.
gVisor (user-space sandbox)	Sentry intercepts syscalls; still shared host kernel	✓	Real improvement over runc, but 5-20% performance penalty on syscall-heavy workloads — CI builds (Bazel, Swift, npm install) hit this hard. Still a shared kernel: if the Sentry has a bug, you’re back to runc-class isolation.
Kata Containers (kata-qemu / kata-fc)	Per-tenant kernel via microVM	✓ (via CRI RuntimeClass)	What we picked. Real kernel boundary per Pod, k8s-native.
Direct Firecracker orchestration (Lambda / Fly model)	Per-tenant kernel via microVM	✗ — bypasses k8s, custom orchestrator	Best snapshot-thaw story (~5-10 ms cold-start), proven at AWS Lambda + Fly.io scale, and the path some self-hosted-runner tooling takes (Actuated, Fireactions). Build a parallel orchestration plane next to our k8s clusters — months of work, lose CNI / Services / scheduler, can’t reuse runners-controller. Not justified by our scale.
Dedicated VMs per job	Per-tenant kernel via traditional VM	Sometimes	~30 s cold start, hundreds of MiB overhead per VM, pay for full VM minute on 30-second jobs. Density bad, cost bad.
Hardware enclaves (SEV-SNP, TDX)	Per-tenant kernel + encrypted memory + attestation	✓ via Kata’s confidential-containers path	*Reached through* kata-qemu, not parallel to it.** v2 product feature for regulated customers (defense, fintech, healthcare). Picking kata-qemu over kata-fc keeps this path open; kata-fc has no Kata-integrated CC story.

Why kata-qemu specifically (and not kata-fc)

Within Kata, two production-supported shims: kata-qemu and kata-fc. We started on kata-fc (better hypervisor minimalism, cooler AWS-aligned story) and switched to kata-qemu after hitting a hard wall.

Dimension	kata-qemu (what we ship)	kata-fc
Per-Pod VMM overhead	~30-50 MiB	~5 MiB
Cold-start	~500-1000 ms	~125 ms
Rootfs delivery to microVM	virtio-fs / 9pfs against host’s default overlayfs snapshotter — works out of the box	Needs the nydus snapshotter + `nydusd` userspace daemon — FC has no 9pfs/virtio-fs. This is the wall: stock containerd overlayfs produces a directory rootfs FC literally can’t mount.
Confidential Computing (AMD SEV-SNP, Intel TDX)	✓ production-supported	✗ no Kata integration
GPU passthrough (ML workloads)	✓ via PCIe passthrough	✗ no PCI bus
Nested virtualization (Docker-in-Docker without `--privileged`, KinD)	✓	✗
Wider guest support (BSDs, alternative Linux distros)	✓ at the hypervisor level — QEMU emulates a full PC platform that any kernel can boot	✗ — minimal device model excludes most non-stock-Linux guests. (Note: neither shim gives us Windows CI for free. Kata’s runtime layer assumes a Linux guest with `kata-agent` as PID 1 — there’s no Windows kata-agent. Windows runners would require KubeVirt, Hyper-V Containers on Windows hosts, or direct QEMU orchestration outside Kata.)
Production deployment density in k8s	AKS Confidential Containers, IBM Cloud sandboxed containers, OpenShift sandboxed containers	Niche / academic for the Kata-in-k8s shape specifically. Production Firecracker users (AWS Lambda, AWS Fargate, Fly.io) orchestrate it directly outside Kata, not via the Kata CRI shim.
Snapshot-thaw warm pool (~5-10 ms cold-start)	Not productionized in Kata	Possible but requires orchestration we don’t have

The cost we’re paying for kata-qemu vs kata-fc:

~45 MiB extra VMM overhead per Pod. Against a 4 GiB Pod allocation that’s ~1% memory tax. On a 64 GiB AX42-U with ~16 Pods: ~720 MiB total VMM overhead instead of ~80 MiB. Negligible against the workload memory budget.
~500 ms slower cold-start. Pod cold-start total is ~3 s either way (image extract + runner init dominates); the FC win is invisible against GitHub’s webhook-to-dispatch latency that’s already 1-5 s.

The benefits we get from kata-qemu vs kata-fc:

No nydus / nydusd packaging. Stock overlayfs works. The kata-fc path requires installing nydusd from dragonflyoss/nydus, writing a working nydusd-config.json, tuning fs_driver modes — meaningful integration work we’d have to land before shipping.
Production-mature pattern in k8s. Every commercial Kata-in-k8s deployment uses QEMU. The audit pedigree is there, the operational tooling is mature, hires who’ve operated Kata at scale have done QEMU.
Confidential Computing path stays open. AMD SEV-SNP / Intel TDX for v2 “attestable CI for regulated customers” — kata-qemu only.
GPU passthrough, nested virt — kata-qemu features that have customer use cases we don’t serve today but would want to (ML training in CI, Docker-in-Docker / KinD without --privileged). Windows CI is not in this list — Kata’s Linux-only guest assumption applies to both shims and would need a separate substrate (KubeVirt or Hyper-V Containers).

Switching back to kata-fc if/when needed

The runtime swap is a focused, ~5-line chart change. If we ever need kata-fc’s ~5 MiB-per-Pod density or its snapshot-thaw advantage at scale we can’t hit otherwise, the path:

infra/helm/tuist/templates/kata-qemu.yaml — rename RuntimeClass name / handler to kata-fc. (1 file, 2 lines.)
infra/k8s/clusters/bare-metal-staging.yaml postInstallScript — flip the containerd drop-in: runtime_type to io.containerd.kata-fc.v2, ConfigPath to configuration-fc.toml. (1 file, 2 lines.)
Add the nydus pre-bake: install containerd-nydus-grpc + nydusd binaries, write /etc/nydus/config.toml + /etc/nydus/nydusd-config.json, register the [proxy_plugins.nydus] block in the containerd drop-in, set snapshotter = "nydus" on the kata-fc runtime block. (Same file, ~40 lines of postInstallScript — most of what we wrote during this PR’s debugging arc, deliberately not committed.)
values-managed-*.yaml — runtimeClass: kata-fc.
Cycle the topology so a new HBM picks up the new template snapshot, validate via linux-runners-staging-smoke.yml.

Worst case ~1 PR + 1-2 staging Node cycles. Nothing in our architecture forecloses on FC; we just don’t need it today, and kata-qemu unlocks confidential-containers + GPU + nested-virt paths that FC forecloses on.

Architecture

GitHub workflow_job (queued) ──▶ Tuist server webhook
                                      │
                                      ▼
                                 ClickHouse runner_jobs
                                 (status=queued for fleet)
                                      │
                                      │  warm microVM polls dispatch with SA token
                                      ▼
                                 Tuist.Runners.dispatch_for_sa
                                      • TokenReview validates SA
                                      • Claims.attempt (PG atomic claim)
                                      • Mint JIT with runner_labels ++ [dispatch_label]
                                      • Stamp tuist.dev/runner-pool-owner on Pod
                                      ▼
                                 microVM execs ./run.sh --jitconfig, runs job, exits
                                      ▼
                                 RunnerPoolReconciler reaps Pod + SA, boots replacement

Autoscaling loop (every 5s, per autoscaling-enabled RunnerPool):
                                 AutoscalerReconciler
                                      │ GET /api/internal/runners/desired_replicas?fleet=<name>
                                      ▼
                                 Tuist server: claimed + queued + p95_concurrent_last_hour
                                      ▼
                                 desired = max(claimed+queued, max(min, p95)) + min
                                      │ patch RunnerPool.spec.replicas
                                      │ also patch bound MD in mgmt cluster
                                      ▼
                                 RunnerPoolReconciler:
                                      • scale up → create microVM-shaped Pod + SA
                                      • scale down → delete idle microVMs only

Runtime layer (per node):
                                 Pod with runtimeClassName: kata-qemu
                                      │ containerd routes to kata-qemu shim
                                      ▼
                                 Kata Containers + QEMU
                                      │ Linux microVM with its own kernel
                                      │ virtio-fs shares host rootfs into VM
                                      ▼
                                 actions/runner inside the microVM

Substrate (caph + hetzner-robot-controller, mgmt cluster):
                                 Operator orders AX-class server via Robot panel
                                      │ names it tuist-bm-*
                                      ▼
                                 hetzner-robot-controller reflects Robot inventory
                                      • creates HetznerBareMetalHost CR
                                      • fills disk WWNs from caph hardware-details
                                      ▼
                                 caph claims the host (via MD topology)
                                      • rescue-boot via Robot API
                                      • SSH in, installimage Ubuntu 24.04 to RAID 1
                                      • postInstallScript runs CHROOTed:
                                          - apt: cloud-init + zstd
                                          - unpack kata-static 3.30.0 → /opt/kata/
                                          - write /etc/containerd/conf.d/kata-qemu.toml
                                          - seed /etc/containerd/config.toml
                                      ▼
                                 Box reboots into fresh OS
                                      │ cloud-init preKubeadmCommands install
                                      │   containerd + kubeadm + kubelet, kubeadm join
                                      │ postKubeadmCommands label node:
                                      │   tuist.dev/kata-runtime=true
                                      │   katacontainers.io/kata-runtime=true
                                      ▼
                                 Node ready — containerd starts with kata-qemu
                                 handler already registered. No runtime DaemonSet,
                                 no post-join containerd restart, no Cilium churn.

How scaling works

Two layers, both reactive:

Pod layer (fast, 5s tick). AutoscalerReconciler calls /api/internal/runners/desired_replicas every 5s. The server returns {claimed, queued, p95_concurrent_last_hour}; the controller composes desired = max(claimed+queued, max(minWarmPoolFloor, p95)) + minWarmPoolFloor and patches RunnerPool.spec.replicas. The RunnerPoolReconciler then creates or deletes Pods + SAs to converge. Scale-down only deletes idle Pods (those without the tuist.dev/runner-pool-owner label) so it never kills a runner mid-job, and is debounced by scaleDownCooldownSeconds.

Node layer (operator-paced). Host count stays operator-managed via the CAPI cluster topology — Hetzner Robot hosts are monthly-billed, so auto-scaling Host count from Pod demand would silently provision unused capacity. To grow capacity, the operator orders another tuist-bm-* box in Robot (the hetzner-robot-controller reflects it into a HetznerBareMetalHost CR) and bumps the bare-metal-worker MachineDeployment’s replicas in cluster-<env>.yaml. caph drives the rescue + installimage + kubeadm-join. Pending Pods are the signal to do this.

Per-host microVM density: ~16 slots on AX42-U, ~64 slots on AX162-R (256 GB RAM cap at the standard 1 vCPU / 4 GB slot). Each Pod is a QEMU microVM (~50 MiB VMM overhead, ~600 ms cold start).

Performance parity with managed-runner vendors

The places we’ll feel slower than competitors today, ranked by how much it matters:

Gap	Why	Fix
Image content	Our `tuist-linux-runner` image carries `actions/runner` + a build/dev tooling layer (~60 apt packages: build-essential, autotools/cmake/ninja, common dev headers, archive tooling, git-lfs, postgres/mariadb/redis/sqlite clients, common CLI utilities, python3 + pip + venv). This covers the bulk of “actions/setup-* compiles native deps” cases (bcrypt_elixir, native Python wheels, native gems, NIFs). What’s missing vs. GitHub-hosted Ubuntu: pre-installed language toolchains (Node, Go, Rust, Java/.NET, multi-version Python/Ruby), cloud CLIs (aws, gcloud, az), Docker-in-Docker, browser test runtimes, ffmpeg/ImageMagick. (Worth noting: commercial runner vendors like Namespace.so don’t actually ship a GitHub-hosted-parity image by default either — their model is “minimal base + customer apt-package customization via Runner Profile” which lands closer to what we have today than to GitHub’s full tool-cache image. The remaining gap is mostly against the GitHub-hosted reference, not against the typical commercial vendor.)	A `tuist-linux-runner-fat` variant tracking GitHub’s `actions/runner-images` repo. Opt-in via `pools[].runnerImage`. Per-pool selection so the lean image stays available for workflows that don’t need the fat one.
Cache locality	Customers reach the Tuist cache over public internet (Tigris, Frankfurt PoP). Competitors run a local cache per VM.	Already on Tigris in Frankfurt; tightening this means a per-pool Kura mesh shard co-located on runner nodes. Follow-up.

The pure-CPU and NVMe sides are now equivalent to (or better than) competitors — dedicated Zen 4 cores, local NVMe DC-class drives, no neighbor noise.

What’s in the PR

feat(server): parametric RunnerPool labels — runnerLabels on the CRD (now required by the schema). JIT mint stops hardcoding macOS/ARM64; Linux pool stamps ["self-hosted", "Linux", "X64"]. The chart renders this field for every pool (per-OS defaults applied in the helm template), so the server treats absent/empty as a chart bug rather than substituting a default.
feat(infra): RunnerPool autoscaling spec — spec.autoscaling.{enabled, minWarmPoolFloor, maxReplicas, scaleDownCooldownSeconds} + status.lastScaleDownAt. Pod-level only — see the Node layer note above.
feat(server): fleet load signals endpoint — Claims.counts_per_fleet/0, Jobs.queued_count_by_fleet/1, Jobs.p95_concurrent_last_hour/1. New GET /api/internal/runners/desired_replicas?fleet=<name>.
feat(infra): autoscaler reconciler — 5s cadence, fail-open on transient server errors, cooldown-gated scale-down, idle-Pod-aware Pod convergence. Pod-only — Host count is operator-managed.
feat(infra): Linux runner image — infra/linux-runner-image/ (Ubuntu 22.04 + actions/runner 2.334.0 + dispatch-poll script).
feat(infra): runnersFleetLinux + parametric scheduling — spec.os (darwin | linux); the controller’s podtemplate switches nodeSelector + tolerations per substrate. Linux pool Pods stamp the tuist.dev/runner-tier=bare-metal:NoSchedule toleration.
feat(infra): hetzner-robot-controller — Go controller in infra/hetzner-robot-controller/. Watches Robot inventory via syself/hrobot-go, creates/deletes HetznerBareMetalHost CRs matching tuist-bm-* server names, auto-fills WWNs from caph’s hardwareDetails. Asymmetric reconciliation: create-by-Robot-name, delete-by-Robot-server-ID. Leader-elected, deployed on the mgmt cluster.
feat(infra): caph bare-metal adoption + Kata + QEMU pre-bake — the production-shape architecture: bare-metal-worker ClusterClass class, HetznerBareMetalMachineTemplate (claims app.kubernetes.io/managed-by: hetzner-robot-controller-labeled hosts) with a postInstallScript that bakes Kata 3.30 binaries + containerd kata-qemu drop-in into the OS at install time. ExternalSecrets for Robot creds + SSH key. kata-qemu RuntimeClass + pools[].runtimeClass chart knob.
fix(infra): CiliumNodeConfig override for bare-metal Node — bare-metal hosts have 64 GB RAM (and AX162-R will have 256 GB). Cilium’s default bpf-map-dynamic-size-ratio: 0.0025 sizes BPF maps as 0.25% of node memory, which on these hosts pushes map allocation past the hive start-hook deadline and fatals the agent before it can bind to its API socket. A per-Node override drops the ratio to 0.0005 (and caps the connection-tracking maps explicitly) — selector tuist.dev/kata-runtime=true confines the override to bare-metal Nodes, leaving the cloud workers untouched.
fix(infra): mgmt-cluster-apply handles CAPI template immutability — HetznerBareMetalMachineTemplate is immutable by CAPI convention. The CI step falls back to delete + apply when kubectl apply reports the immutability error so PR-driven template changes don’t permanently break the workflow.

Test coverage

Elixir: 51 tests across Tuist.Runners.Claims, Tuist.Runners.Jobs, Tuist.Runners.Dispatch, TuistWeb.RunnersController.
Go: 25 tests in the runners-controller; new tests in hetzner-robot-controller covering inventory diff, create-by-name / delete-by-id asymmetry, and the WWN-fill reconciler.
Helm: helm lint clean against values-managed-common.yaml + values-managed-staging.yaml.
End-to-end: A linux-runners-staging-smoke.yml workflow exercises the dispatch + reachability + egress path against the staging Linux fleet. Earlier in the PR’s life a heavier variant of the smoke ran a real server-workflow job (credo) end-to-end on a kata-qemu microVM Pod (gh run view 26019131108) and reported success. That run predates the containerd-version fix in this PR; the smoke needs to be re-validated on a freshly-provisioned bare-metal Node after merge (containerd 2.x tarball install replaces apt’s 1.7.x) before promoting beyond staging.

Operator bring-up procedure (staging)

Documented in infra/runners-controller/AGENTS.md. Summary:

Order an AX42-U from Hetzner Robot panel (FSN1). Name it tuist-bm-staging-<n>. Paste the public key from the hetzner-bare-metal-ssh-key Secret into the order form.
Within ~30s the hetzner-robot-controller notices the new server and creates a matching HetznerBareMetalHost CR.
Bump replicas on the runners-linux MachineDeployment in cluster-staging.yaml.
Apply via mgmt-cluster-apply.yml. caph claims the host (rescue → installimage with kata pre-baked → reboot → cloud-init kubeadm join). 8-15 minutes.
Trigger the smoke workflow. The dispatch chain claims a microVM, registers it with GitHub, runs the workflow_job.

What didn’t make it

Snapshot-thaw warm pool, as in AWS Lambda’s ~5-10 ms cold-start. The original Firecracker pitch leaned on this. Kata’s integration of FC snapshot/restore isn’t production-mature, and even with kata-qemu the equivalent integration isn’t there. The “live warm pool” (Pods continuously running, polling dispatch) already gives us single-digit-ms claim latency without needing snapshot orchestration. If we ever need Lambda-class cold-start, the path is direct Firecracker orchestration outside Kata, which is the larger architectural lift discussed in the Why-Kata section.

Pre-merge: bare-metal hardware to order

The PR has been validated end-to-end on staging earlier in its life (single AX42-U at FSN1, smoke run 26019131108 ran credo on a kata-qemu microVM Pod and reported success in 5m 12s) — but the containerd install path has since been corrected (apt 1.7.x → tarball 2.x), so the smoke needs to be re-run on a freshly-provisioned Node post-merge before promoting beyond staging. Before merging, two more hosts to order from Hetzner Robot panel:

Canary — 1× AX42-U in FSN1, name tuist-bm-canary-1. Mirrors staging exactly. Single host is fine: canary is a deploy-pipeline gate, not an HA tier. ~€57/mo + €234 setup. values-managed-canary.yaml already has runnersFleetLinux wired with minWarmPoolFloor: 2 / maxReplicas: 14.

Production — 2× AX162-R in FSN1, names tuist-bm-production-1 and tuist-bm-production-2. 64 microVM slots per host × 2 = 128 steady-state ceiling. Single-host failure leaves 64 slots online — matches the current Namespace.so Linux concurrency ceiling (64 vCPU / 128 GiB) the fleet is replacing. Sized against the observed P95 of ~60 concurrent on Namespace with the existing pipeline already exceeding the 128 GiB memory limit (144 GiB peak). ~€790/mo + €590 setup. values-managed-production.yaml already has runnersFleetLinux wired with minWarmPoolFloor: 30 / maxReplicas: 120.

Both blocks expect the tuist-bm-* server name prefix in Robot panel so the hetzner-robot-controller reflects each new server into a HetznerBareMetalHost CR automatically. SSH key on the order form: the public key from tuist-k8s-mgmt vault’s HETZNER_BARE_METAL_SSH_KEY item — same key staging uses, fleet-wide.

When to add a 3rd production host: the autoscaler reports rolling P95 via /api/internal/runners/desired_replicas. If sustained P95 climbs over ~50 concurrent for a two-week window, order tuist-bm-production-3 — ~€395/mo + €295 setup. No PR needed; the controller picks it up automatically once it’s named tuist-bm-production-*.

Total Hetzner spend introduced by this PR: ~€847/mo + €824 one-time.

Deferred (acceptable follow-ups)

kata-fc + nydus — if we ever want Firecracker specifically, this is the focused PR: install nydusd from dragonflyoss/nydus, write a working nydusd-config.json, configure containerd-nydus-grpc with fs_driver = blockdev (or kata’s preferred mode), wire [proxy_plugins.nydus] in the containerd drop-in. All independently testable, doesn’t touch anything else in this PR.
Custom Node image via Packer — current pre-bake runs the postInstallScript chrooted in stock Ubuntu and downloads kata-static + dependencies per Node provision. A Packer-built image hosted on Hetzner Snapshots would skip the per-Node download (~30-60s saved per provision) and make Node spec fully immutable. Mirrors the pattern in infra/xcresult-processor-image/.
tuist-linux-runner-fat variant for GitHub-hosted-parity — current image carries build/dev tooling + common utilities (closes the bcrypt_elixir / NIF-compile class of failures), but workflows that expect pre-installed language toolchains (Node, Go, Rust, Java/.NET, multi-version Python/Ruby), cloud CLIs (aws/gcloud/az), Docker-in-Docker, browser test runtimes, or media tooling (ffmpeg, ImageMagick) will need either actions/setup-* steps that bootstrap on-the-fly or a separate fat-image variant. The fat variant tracks GitHub’s actions/runner-images repo and is opt-in via pools[].runnerImage.
Linux runner image release pipeline. The image gets SHA-tagged; release-linux-runner-image (semver + Renovate digest rewrite + GitHub Release) isn’t wired into release.yml yet. Operators bump runnersFleetLinux.pools[].runnerImage manually until that lands.
Auto-stamp node.cluster.x-k8s.io/pool label + remove uninitialized taints in postKubeadmCommands. Today these are operator-applied for bare-metal Nodes (CAPI’s label-sync hits a race, and --cloud-provider=external sets uninitialized taints with no CCM removing them for bare-metal). Cheap to automate.

Comments

github-actions[bot] May 19, 2026

🚨 TruffleHog Secret Scan Failed

Verified secrets were detected in this pull request.

Please take the following actions:

Rotate the exposed credential(s) immediately - assume they are compromised
Remove the secret from your code - use environment variables or a secrets manager instead
If the secret was committed previously, you may need to rewrite git history using git filter-repo or similar tools

For more information, check the workflow run logs.

fortmarek May 19, 2026

False positive — docker pull from ghcr.io/trufflesecurity/trufflehog:3.92.4 timed out before the scan executed (Process completed with exit code 125, log line 12:04:49). The workflow posts this comment unconditionally on non-zero exit, but no TruffleHog scan ran. Re-running the workflow.