feat(kura): Private Network + NodePort data plane for the Scaleway macOS runner cache

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11256

Updated

Jun 24, 2026

Domains

Kura

Details

What

macOS Tart runner VMs on the Scaleway Mac mini fleet now consume their per-account kura cache nodes over a Scaleway Private Network, dialing a per-account NodePort published by the node hosting their cache pod. Dispatch hands the VM http://<node PN address>:<account NodePort> instead of a cluster-DNS URL, so the runner’s read/write hot path stays inside Scaleway fr-par.

Those per-account Scaleway nodes are full members of the account’s Kura mesh: cache content replicates between the fr-par runner-cache node and the account’s Hetzner nodes in the background, under a per-account CA, so the cache stays coherent across regions while the hot path stays node-local.

Builds on #10982 (Linux runner cache), which left macOS runners reaching a Hetzner kura node across the WAN via the tailnet.

Topology (what we landed on)

Two planes, deliberately separated. The runner’s hot path (read/write on every cache operation) is PN-local in fr-par. Replication (keeping the account’s mesh coherent) runs in the background, bandwidth-limited, over the cluster overlay:

flowchart LR
  subgraph mini["Mac mini (runners-fleet · fr-par)"]
    vm["macOS Tart VM<br/>PN 172.16.0.3 · VLAN 2319"]
  end

  subgraph scw["Scaleway fr-par"]
    pod["per-account runner-cache pod<br/>node 172.16.0.2 · SBS · egress cap 750M"]
  end

  subgraph htz["Hetzner umbrella cluster"]
    eu["account's eu-central pod"]
    stg["account's staging pod"]
  end

  vm ==>|"HOT PATH: PN NodePort http://172.16.0.2:&lt;port&gt;<br/>fr-par-local · 5.11x cold / 2.27x warm vs public"| pod
  pod <-.->|"REPLICATION: account peer mesh<br/>per-account-CA mTLS · bandwidth-limited · cluster overlay (WAN)"| eu
  pod <-.-> stg

Hot path (PN, fr-par-local). Dispatch hands the VM Server.url (http://<node PN IP>:<account NodePort>), which the reconciler re-stamps every converged tick so the endpoint follows the pod if it reschedules. The NetworkPolicy admits the PN subnet via an ipBlock (NodePort clients arrive with their real source IP under externalTrafficPolicy: Local, matching no namespaceSelector), and the Cilium bandwidth manager caps each tenant’s egress on the shared node NIC.

Replication plane (mesh, over the WAN). Each account’s pods (the fr-par runner-cache node plus its Hetzner nodes) discover each other through the per-account headless Service (KURA_GLOBAL_DISCOVERY_DNS_NAME = kura-<account>-peers...) and replicate over the peer port with mutual TLS. It rides the Cilium node-to-node overlay across providers, is bandwidth-limited, and is fully off the hot path: a build’s cache round-trips never wait on it. So cache content does cross the WAN (correcting the earlier “never leaves fr-par” framing), but asynchronously and rate-limited, never on the latency-sensitive read/write.

This is the macOS counterpart to the Linux runner cache from #10982, same “co-locate the per-account kura pod next to the fleet” idea, with a different hot-path data plane because the runtimes sit on different networks:

	Linux fleet (#10982)	macOS fleet (this PR)
Runner runtime	kata Pod on the cluster pod network	Tart VM on the Scaleway PN
Cache node pool	co-located Linux pool (Hetzner)	co-located `kura-scw-fr-par` (Scaleway fr-par)
Hot-path data plane	ClusterIP DNS `<instance>.kura.svc.cluster.local:4000`	NodePort over the PN
Endpoint dispatch hands out	stable Service URL	`http://<node PN IP>:<NodePort>`, re-stamped as the pod moves
Source-IP / NetworkPolicy	in-cluster `namespaceSelector`	`externalTrafficPolicy: Local` + `ipBlock` (PN CIDR)
Replication	in-cluster peer mesh	same peer mesh, now spanning fr-par↔Hetzner per account

Both fleets converge through the same runner_cache.ex reconciler and the same runner_cache_endpoint_url/2 handoff; the region spec (data_plane, client_cidrs, pod_annotations, runner_platforms, mesh) is what makes one region cluster-DNS and another NodePort, so adding a locality is a values change, not a code change.

Per-account CA isolation

Cross-region peering is gated on the new mesh flag, set on every managed and private region (single cluster, so in-cluster discovery spans all of an account’s nodes with no cross-cluster gateway needed). The controller maintains a per-account CA (kura-<account>-peer-ca) and signs each instance’s peer leaf from it, with SANs covering the account peer Service (the SNI the replication client verifies). This is both the isolation boundary and a functional requirement: a leaf signed by one account’s CA is rejected at the TLS handshake by another account’s pods, and the previous auto-issued per-instance self-signed CA could not have authenticated an account’s own instances to each other. Control-plane auth/usage still goes to KURA_CONTROL_PLANE_URL over the public path; that is unchanged.

Why this design

The Mac VMs are not on the cluster’s pod network, so the Linux fleet’s ClusterIP path can’t serve them. Three candidate data planes:

Tailnet subnet router (the interim this branch started with): works, but adds WireGuard crypto on the host, an MTU-1280/MSS-1200 clamp, rides the minis’ metered public bandwidth, and funnels through one router pod.
Node-as-router over the PN: needs bpf-lb-external-clusterip=true flipped cluster-wide (Cilium KPR doesn’t translate ClusterIPs for externally-arriving traffic), and a single routed gateway caps the entire fleet at one node’s NIC.
NodePort over the PN (chosen): no cluster-wide Cilium change, client source IPs survive (externalTrafficPolicy: Local) so NetworkPolicies stay meaningful, and it scales out, since each account’s endpoint follows its pod’s node so adding nodes adds aggregate bandwidth with no gateway bottleneck.

The M4 fleet move decides it: Scaleway gives M4-M/M4-Pro minis 10 Gb/s on the Private Network included, vs 1 Gb/s public by default (M1/M2 are 1/1). Instance-side PN bandwidth is a separate budget from public, so vxlan/control traffic never competes with cache reads. The PN is simultaneously the isolated path and the fast path.

How

kura-controller: KuraInstance gains exposeNodePort (renders an <instance>-external NodePort Service pinned to the primary pod, allocated ports preserved across reconciles), clientCIDRs (NetworkPolicy ipBlock), podAnnotations (per-region kubernetes.io/egress-bandwidth caps), and mesh (controller-managed per-account peer CA + account-scoped leaf SANs; flips crossRegionRuntimeEnabled so KURA_GLOBAL_DISCOVERY_DNS_NAME and the peer mesh turn on). Status reports nodeAddress (the node’s tuist.dev/pn-ipv4 label) + allocated NodePorts.
server: the scw-fr-par-runners region declares data_plane: :node_port, the PN client CIDR, and a 750M egress cap; every managed and private region sets mesh: true. Activation waits for the observed node-port chain the way it waits for public DNS, and a refresh on every converged reconciler tick re-stamps Server.url when the primary pod moves nodes. Dispatch itself is untouched (it already serves Server.url).
infra: scaleway-csi chart values + ExternalSecret (per-env SCALEWAY_API 1P item) providing the scw-bssd StorageClass; hcloud-csi’s node plugin skips the kura-scw-fr-par pool. The deploy now builds + pins the capi and runners-controller images per deploy-SHA (see Bugs).

Bugs found during the staging rollout (fixed here)

CRD pruning + stale manifest revision (34c5f3ccf7): helm never upgrades crds/ on upgrade, so instances created against the old CRD had the new spec fields silently pruned, and a matching manifest-revision annotation meant they were never re-applied. Revision bumped so every instance converges declaratively once the CRD lands.
Nodes RBAC in a namespaced Role (ea5e2a7788): Nodes are cluster-scoped; controller-runtime’s Node informer got “nodes is forbidden” and every KuraInstance reconcile stalled. Now a dedicated ClusterRole.
Server SA was a helm hook (10d07f737d): before-hook-creation deleted and recreated the tuist-server ServiceAccount on every deploy, invalidating every running pod’s projected token for up to a token-refresh period (kura-reconciler 401s, and the runner-dispatch TokenReview path broke after every deploy). The migration Job now has its own hook-managed SA; the server SA is a plain release resource with resource-policy: keep. (Superseded by main’s evolution of the same fix during the rebase.)
capi / runners-controller flag-vs-image skew (bbb861eedc, 3bb053f80bd): the deploy resolved these two component images from the latest <component>@ semver tag while deploying the chart at the branch SHA. This branch adds new manager flags to both (--vm-cluster-dns-ip / --vm-kura-egress-cidr on capi; -cluster-dns-ip / -cluster-domain on runners-controller), so the stale released binaries hit flag provided but not defined, os.Exit(2), CrashLoopBackOff. helm --wait then failed and --atomic rolled the whole release back (it ate two staging deploys before this was understood). Fix: build and pin both images per deploy-SHA, exactly as the server and kura-controller images already are, so the binary and the flags the chart passes it always come from one commit. This eliminates the skew class for good; the @-semver release path stays for tracking only.

Declarative provisioning (found by driving it live on staging)

The runner-cache node is now ordered declaratively by the in-house Scaleway CAPI provider (a new ScalewayInstanceMachine kind + kura-fleet MachineDeployment), instead of being hand-joined. Bringing that path up end-to-end surfaced a chain of bugs, each fixed here:

Provision/delete not idempotent (346e0219e1f): CreateInstance bundled create + PN-attach + power-on, so a partial failure stranded a half-configured paid instance, and DeleteInstance unconditionally powered off (Scaleway rejects that on an already-stopped server, wedging finalizer removal). Split into find-or-create + idempotent EnsurePrivateNIC/EnsurePoweredOn (ServerID recorded before them, providerID only after both succeed) and a stop-then-delete state machine tolerant of already-stopped / already-gone.
No public IP (87761e20a35): the instance came up with only its PN NIC, so cloud-init couldn’t pull kubelet/containerd or reach the externally-managed control plane and the node sat in Bootstrapping forever. Request a dynamic public IPv4; cache traffic still rides the PN.
cloud-init aborted under dash (7b365e10711): runcmd runs under /bin/sh (dash), which rejects set -o pipefail, so the whole bootstrap aborted on its first line and kubelet was never installed. Moved the bootstrap into a bash script invoked via runcmd: [bash, ...].
Kubelet RBAC incomplete (6ea62c378d1, eacf38286b9, 770df4c5eb0): the per-node identity bound to the chart’s tart-kubelet ClusterRole, which only carries what the macOS shim exercises. A real Linux kubelet was forbidden its heartbeat Lease and Services (every system pod failed CreateContainerConfigError "services have not yet been read"), then PVC/PV/volumeattachments and csidrivers once cache pods landed. Added the missing perms, then bound the Linux identity to system:node for good (5c3d7ed66ba, see follow-ups).
pn-ipv4 never resolved (d9f8cb379ef, 6674f866b37, 868c67a90ba): the IPAM ListIPs call set both Zonal and PrivateNetworkID (the API wants exactly one), filtered by the lowercase Instance MAC while IPAM stores it uppercase, returned the NIC’s first address (the IPv6), depended on GetServer populating PrivateNics, and omitted the regional Region field, all under a swallowed error so the label silently never landed. Now lists by PN + Region, matches the row by resource name or MAC (case-insensitive), returns the IPv4, requeues until stamped, and logs the outcome.
scaleway-csi didn’t tolerate the runner-cache taint (8356026f832): its node DaemonSet never scheduled on the tainted fleet node, so the node’s CSINode registered no scaleway driver and the scw-bssd cache PVCs couldn’t attach there. Added the toleration.

Declarative node provisioning (replaces the hand-joined stand-in)

The kura-scw-fr-par node is provisioned declaratively by the cluster-api-provider-scaleway-applesilicon operator. A new ScalewayInstanceMachine kind orders a regular Scaleway Instance (PRO2-S) with a dynamic public IP, attaches it to the runner-cache PN, mints a per-machine kubelet identity, and renders a self-join cloud-init that installs containerd/kubelet and registers the node (this cluster’s control plane is externally managed, so there is no CAPI kubeadm join). A kura-fleet MachineDeployment drives it with the runner-cache taint and pool label; the controller stamps the foreign providerID and the tuist.dev/pn-ipv4 label. scaleway-csi and the node kubelet RBAC both tolerate the runner-cache taint so the cache pods’ scw-bssd volumes attach on the node.

Validated end-to-end on staging: the declarative node joined and reached Ready with providerID + tuist.dev/pn-ipv4 + pool label + taint, all 11 per-account cache StatefulSets migrated onto it (volumes re-attached via scaleway-csi, every KuraInstance Ready and meshed), and the original hand-joined PRO2-S was drained and its instance terminated.

Two operational prerequisites live in Scaleway IAM (not the repo) and must be granted to each environment’s provider key: PrivateNetworksFullAccess (attach the instance to the PN) and IPAMReadOnly (read the PN address for the pn-ipv4 label).

Cache node sharing: pod-per-customer on a shared node (for now)

We run one cache pod per customer, but co-located on a shared Scaleway node, not a dedicated node per customer. The cache pod is I/O-bound and light on CPU/RAM (the Hetzner kura nodes run on 2 vCPU ccx13), so one PRO2-S comfortably hosts many per-account pods. Sharing amortises the node: one ~EUR160/mo PRO2-S serves all accounts, and adding a customer adds only a ~EUR4/mo cache volume.

The trade-off is bandwidth. Scaleway gives each instance a dedicated, guaranteed per-connection allocation (unlike Hetzner Cloud, whose cloud VMs are best-effort shared at ~300-500 Mbps), but it’s coupled to instance size: PRO2-S is 1.5 Gbps to the PN, separate from its 1.5 Gbps internet. On a shared node that 1.5 Gbps PN is split across the co-located pods behind a 750M per-tenant egress cap, so roughly 2 tenants at full cache-read saturate the node; the fleet-spread controller distributes pods as the pool grows.

This is a values-level decision, reversible per environment. Moving to a dedicated node per (enterprise) customer is a config change, not a rewrite: the region spec and per-account scheduling already model pod-per-customer, so we would raise kuraFleet replicas / add a per-customer pool and pin each account’s pod. Scaleway economics are ~EUR100/Gbps/month either way: a dedicated PRO2-S gives a customer the full uncontended 1.5 Gbps (~EUR165/mo), or a POP2-HN (“High Network”) instance scales to 5 Gbps on lighter compute (~EUR495/mo). To know when that move is worth it, this PR adds host-NIC bandwidth analytics (the “Node Bandwidth” dashboard panels + the hostMetrics node_network_* allow-list) so we can watch the shared node’s PN throughput against the Hetzner kura nodes.

Update: measured bare metal vs PRO2-S (supersedes the ~EUR100/Gbps speculation above)

We benchmarked this directly on Scaleway (throwaway staging boxes, PRO2-S vs an EM-B220E-NVME Elastic Metal box, same client, all over the PN):

Over the PN	PRO2-S (today)	Elastic Metal `EM-B220E`
Sustained throughput	1.52 Gbit/s (after a ~5 Gbit/s / ~10s burst, then a hard token-bucket clamp)	6.7 Gbit/s flat, no clamp (capped by the 6.4G client; real ceiling ~10 Gbit/s)
Disk read (cold)	scw-bssd 523 MB/s (4.2 Gbit/s)	local NVMe 3442 MB/s (27.5 Gbit/s)
Realistic 6 GB parallel pull, cold / warm	2.78 / 2.05 Gbit/s	4.68 / 4.67 Gbit/s
Price	~EUR163/mo, 32 GB RAM	~EUR120/mo, 64 GB RAM

Two things the spec sheet hides. First, the PRO2-S PN is token-bucketed: a single bursty cache fetch rides ~5 Gbit/s, but sustained / multi-tenant load clamps to 1.52 Gbit/s. Second, the disk is not the PRO2-S bottleneck (scw-bssd’s 4.2 Gbit/s sits above the PN), so NVMe only matters once the NIC can outrun it, which is exactly the 10G Elastic Metal regime. Net: Elastic Metal is ~1.7-2.3x on a single bursty pull and ~4.4-6.5x under sustained shared load, with no burst-credit dependency, and it is cheaper.

So the direction this PR takes (now landed — see the closing Elastic Metal pivot landed update) is to put the cache node on Scaleway Elastic Metal rather than a bigger instance: a ScalewayElasticMetalMachine kind in the in-house CAPI provider plus a local-NVMe storage class (Elastic Metal can’t attach scw-bssd, but a regenerable cache wants fast local NVMe anyway). The shared pod-per-customer model is unchanged; the node underneath just gains a 10G NIC and local NVMe. A box like EM-B220E (64 GB RAM, 2x1 TB NVMe, 10G) realistically holds ~15-20 customer pods before RAM (warm-set page cache) or local disk binds, with the PN comfortably oversubscribing bursty CI load; a larger box (EM-B320E / EM-I220E) scales that toward ~40.

Sizing the bare-metal node. Optimize for RAM (cgroup-charged page cache = each tenant’s warm set and hit rate) and local NVMe capacity (per-account volumes) — not PN bandwidth or CPU, which a 10G/25G box oversubscribes for a light, bursty cache. By €/pod (~EUR5.5-6.5 across the balanced boxes) the Iridium line is the sweet spot: RAM-rich, balanced NVMe, 25G PN headroom. So production runs EM-I220E (128 GB RAM / 1.92 TB NVMe / 25G PN, ~EUR230/mo, ~30-40 tenants; run 2+ for failure isolation, step up to EM-I320E when a node fills), and staging/canary run EM-B220E (~EUR120/mo, the cheapest in-stock box that still mirrors the production NVMe + 10G path for validation). Set per-env via kuraFleet.machine.offerType. The binding constraint is RAM + disk, not bandwidth: scale out (add nodes) when those fill.

Validation

All of the below ran against the branch deployed to staging (helm rev 292, all of server / kura-controller / capi / runners-controller on the same per-SHA image).

kura-controller go test ./... (incl. new per-account-CA tests: leaf signed by account CA, CA shared across an account’s instances, cross-account leaf rejected); server kura + runners-controller suites.
Mesh-wide live: 11 per-account CA secrets + 11 account peer Services; the fr-par runner-cache node discovers all 4 of an account’s peers, completes bootstrap, applies replicated artifacts, and keeps its push outbox drained to 0. Sampled accounts have distinct CA fingerprints (cross-account isolation holds on the live cluster).
macOS-over-PN combined smoke 27554115810: real dispatch-claimed macOS runner VM, routed phase pinned to the PN NodePort http://172.16.0.2:30985, real Gradle build-cache traffic:

Phase (full Gradle build wall-clock) Routed (PN) Baseline (public) Speedup

Cold (local caches purged) 4.52 s median 23.11 s 5.11x

Warm 4.08 s median 9.25 s 2.27x
Per-request latency from the mini: PN NodePort /up p50 2.3 ms vs public path p50 70 ms / p90 147 ms, roughly 30x per round-trip; cold-build savings approximate that delta times the build’s serialized cache round-trips.
PN throughput mini→node: 117 MB/s = M2 NIC line rate (M4 raises the PN ceiling to 10 Gb/s).
Declarative provisioning + migration (live, staging): the kura-fleet MachineDeployment ordered the PRO2-S, the controller attached the PN + public IP, the bash cloud-init self-joined, and the node reached Ready with providerID + tuist.dev/pn-ipv4: 172.16.0.4 + pool label + taint. All 11 per-account cache StatefulSets then drained off the hand-joined node onto the declarative node (volumes re-attached, all KuraInstances Ready and meshed), and the hand-joined instance was terminated.

Phase (full Gradle build wall-clock)	Routed (PN)	Baseline (public)	Speedup
Cold (local caches purged)	4.52 s median	23.11 s	5.11x
Warm	4.08 s median	9.25 s	2.27x

Elastic Metal pivot landed + macOS runner-cache validated e2e on staging

The bare-metal pivot described above is no longer a follow-up — it landed in this PR and is serving on staging. The PRO2-S provisioning and validation sections above are the now-historical stand-in that preceded it (kept for the reasoning trail).

What landed. The in-house Scaleway CAPI provider gained a ScalewayElasticMetalMachine kind (orders an Elastic Metal box, enables then attaches the PN as a server option, self-joins via cloud-init), driven by the kura-fleet MachineDeployment. Elastic Metal can’t attach scw-bssd, so the cache’s scw-local-nvme StorageClass is backed by a local-path provisioner on the box’s NVMe. Box per env via kuraFleet.machine.{kind: elasticMetal, offerType}: staging/canary EM-B220E-NVME, production EM-I220E.

Validated e2e on staging (our own tuist account, region scw-fr-par-runners), after fixing the bugs below:

EM node provisions declaratively (order → PN-option enable → PN attach VLAN → self-join → Ready with clusterDNS + tuist.dev/pn-ipv4 + runner-cache taint) via the in-cluster operator.
Cache pod 1/1 on the EM node, backed by local NVMe, state:serving.
Meshes with all 4 of the account’s Hetzner peers (eu-central-1 ×3 + staging-0), writer_lock_owned — cross-cloud (Scaleway↔Hetzner) replication over the Cilium overlay.
Server marks the Kura.Server :active; runner_cache_endpoint_url(account, :macos) returns the node-port URL http://<node PN IP>:<NodePort>, and the cache answers 200 on that exact PN endpoint.
The staging runners smoke (runners-staging-smoke.yml) now asserts this from inside a real Tart VM (curl $TUIST_CACHE_ENDPOINT/up over the PN).

Bugs found driving the EM pivot live (fixed here):

local-path-provisioner missing pods: create/delete RBAC: the provisioner sets up each PV’s hostPath via a short-lived helper Pod; without create/delete on pods every provision failed with a 403 and the cache PVCs hung Pending, so the pod never scheduled onto the EM node.
apiserver kubelet-preferred-address-types had no InternalIP: it was ExternalIP,Hostname,InternalDNS,ExternalDNS. Cross-cloud nodes (Elastic Metal + the macOS PN fleet) report a reachable InternalIP but no ExternalIP and a Hostname the Hetzner apiserver can’t resolve, so it fell through to the Hostname and kubectl logs/exec to those nodes’ pods failed (no such host). Delivered as a ClusterClass kubeletPreferredAddressTypes variable + patch — CAPI rejects in-place KubeadmControlPlaneTemplate edits (immutable spec). The default equals the current value, so it applies as a verified no-op (every live control plane already carries it); inserting InternalIP is then a deliberate per-env variable flip that rolls only that env’s control plane.
KuraInstance storage-class migration isn’t declarative (finding): a StatefulSet’s volumeClaimTemplates are immutable and the controller updates-in-place, so flipping a live instance’s storageClassName (scw-bssd→scw-local-nvme) silently no-ops; it needs a StatefulSet delete+recreate (done manually here).
stuck-:failed runner-cache nodes aren’t auto-retried (finding): nodes_to_retry only self-heals servers with current_image_tag == nil, so a node that deployed then failed is never re-provisioned; cleared by an operator reset during bring-up.
mise lockfile gaps for the staging deploy (pomerium/cli, jq, kind): added linux-x64 lock entries so the deploy job’s mise install doesn’t hit GitHub API rate limits resolving them.
deploy-workflow component image overrides (extends the capi/runners flag-skew fix above): added capi_image_tag + runners_controller_image_tag inputs to server-deployment.yml so a deploy can pin a per-SHA build of either component instead of resolving the stale @-semver release.

Pre-merge decisions / follow-ups

Taint the Scaleway runner-cache node tuist.dev/runner-cache=true:NoSchedule: done declaratively (the reconciler registers the node with the taint via cloud-init --register-with-taints). The cache pods’, scaleway-csi’s, and kubelet’s matching tolerations all ship here.
Per-environment provider IAM + PN: PrivateNetworksFullAccess + IPAMReadOnly (+ BlockStorage) verified live on the staging, canary, and production provider keys; a runner-cache Private Network per env (172.16.0.0/22, each in that env’s own Scaleway project), resolved by name tuist-runner-cache (the operator find-or-creates it), so no UUID is pinned in values.
Canary / production macOS-cache cutover: scaleway-csi installed with the runner-cache toleration on both clusters; kuraFleet.enabled: true for both, so each env orders its node, self-joins, and binds to system:node. With no customers on the runners yet, the macOS hot path is flipped on for canary and production (not just staging): scw-fr-par-runners advertised in TUIST_KURA_AVAILABLE_REGIONS, TUIST_RUNNERS_CLUSTER_NETWORK_PLATFORMS=linux,macos, and the minis attached via macosFleet.vmCachePrivateNetwork. Each env provisions in-cluster onto its own EM node.
Per-cluster prerequisites for the canary / production cutover (not in the deploy pipeline; until present, dispatch returns no endpoint and builds fall back to the public cache, so the cutover degrades rather than breaks): (1) apply the scw-local-nvme StorageClass once the EM node joins, kubectl apply -f infra/k8s/mgmt/bootstrap/local-path-provisioner.yaml; (2) rename each env’s runner-cache PN to tuist-runner-cache. Runbook: infra/cluster-api-provider-scaleway-applesilicon/docs/scaleway-elastic-metal-support.md.
Scope the tart-kubelet ClusterRole back to the macOS shim once the one legacy staging node (still bound to it) is recycled; new Linux nodes use system:node.
Drop the now-obsolete tailnet pieces (Connector proxyClass pin, accept-routes/Service-CIDR advertisement for macs, MSS-clamp host rules) or keep them as fallback, since NodePort made the tailnet cache path redundant.
Keep or remove the smoke workflow file (the Linux PR removed its scaffolding pre-merge; durable tooling lives in tuist/cache-benchmark#2).
Cut kura-controller release tags post-merge and drop the staging image-tag pins.
Flip the kubeletPreferredAddressTypes ClusterClass variable per env (staging → canary → production) to insert InternalIP; each flip rolls only that env’s control plane. The mechanism is committed and applies as a no-op; after staging’s flip, confirm kubectl logs/exec work against EM + macOS PN node pods. To flip: set kubeletPreferredAddressTypes: ExternalIP,InternalIP,Hostname,InternalDNS,ExternalDNS under that Cluster CR’s spec.topology.variables.
Make the KuraInstance controller recreate the StatefulSet on a storageClassName change (the scw-bssd→scw-local-nvme migration was manual).
Auto-retry stuck-:failed runner-cache nodes (today only current_image_tag == nil self-heals).

🤖 Generated with Claude Code

Comments

tuist-atlas[bot] Jun 19, 2026

Private Network + NodePort data plane for the Scaleway macOS runner cache is now available in xcresult-processor-image@0.26.0. Update to this version to use it.

tuist-atlas[bot] Jun 19, 2026

The Private Network + NodePort data plane for the Scaleway macOS runner cache is now available in runners-controller@0.13.0. Update to ghcr.io/tuist/tuist-runners-controller:0.13.0 to use this feature.

tuist-atlas[bot] Jun 19, 2026

The Private Network + NodePort data plane for the Scaleway macOS runner cache is now available in runner-image@0.7.0. Update to the new runner image tags (ghcr.io/tuist/tuist-runner:macos-26-5-0.7.0, ghcr.io/tuist/tuist-runner:macos-26-4-1-0.7.0, or ghcr.io/tuist/tuist-runner:macos-26-3-0.7.0) to use this feature.

tuist-atlas[bot] Jun 19, 2026

Private Network + NodePort data plane for the Scaleway macOS runner cache is now available in capi-scaleway@0.9.0. Update to ghcr.io/tuist/capi-provider-scaleway-applesilicon:0.9.0 to use this feature.