Hive
feat(kura): Private Network + NodePort data plane for the Scaleway macOS runner cache
GitHub issue · Closed
What
macOS Tart runner VMs on the Scaleway Mac mini fleet now consume their per-account kura cache nodes over a Scaleway Private Network, dialing a per-account NodePort published by the node hosting their cache pod. Dispatch hands the VM http://<node PN address>:<account NodePort> instead of a cluster-DNS URL, so the runner’s read/write hot path stays inside Scaleway fr-par.
Those per-account Scaleway nodes are full members of the account’s Kura mesh: cache content replicates between the fr-par runner-cache node and the account’s Hetzner nodes in the background, under a per-account CA, so the cache stays coherent across regions while the hot path stays node-local.
Builds on #10982 (Linux runner cache), which left macOS runners reaching a Hetzner kura node across the WAN via the tailnet.
Topology (what we landed on)
Two planes, deliberately separated. The runner’s hot path (read/write on every cache operation) is PN-local in fr-par. Replication (keeping the account’s mesh coherent) runs in the background, bandwidth-limited, over the cluster overlay:
flowchart LR
subgraph mini["Mac mini (runners-fleet · fr-par)"]
vm["macOS Tart VM<br/>PN 172.16.0.3 · VLAN 2319"]
end
subgraph scw["Scaleway fr-par"]
pod["per-account runner-cache pod<br/>node 172.16.0.2 · SBS · egress cap 750M"]
end
subgraph htz["Hetzner umbrella cluster"]
eu["account's eu-central pod"]
stg["account's staging pod"]
end
vm ==>|"HOT PATH: PN NodePort http://172.16.0.2:<port><br/>fr-par-local · 5.11x cold / 2.27x warm vs public"| pod
pod <-.->|"REPLICATION: account peer mesh<br/>per-account-CA mTLS · bandwidth-limited · cluster overlay (WAN)"| eu
pod <-.-> stg
Hot path (PN, fr-par-local). Dispatch hands the VM Server.url (http://<node PN IP>:<account NodePort>), which the reconciler re-stamps every converged tick so the endpoint follows the pod if it reschedules. The NetworkPolicy admits the PN subnet via an ipBlock (NodePort clients arrive with their real source IP under externalTrafficPolicy: Local, matching no namespaceSelector), and the Cilium bandwidth manager caps each tenant’s egress on the shared node NIC.
Replication plane (mesh, over the WAN). Each account’s pods (the fr-par runner-cache node plus its Hetzner nodes) discover each other through the per-account headless Service (KURA_GLOBAL_DISCOVERY_DNS_NAME = kura-<account>-peers...) and replicate over the peer port with mutual TLS. It rides the Cilium node-to-node overlay across providers, is bandwidth-limited, and is fully off the hot path: a build’s cache round-trips never wait on it. So cache content does cross the WAN (correcting the earlier “never leaves fr-par” framing), but asynchronously and rate-limited, never on the latency-sensitive read/write.
This is the macOS counterpart to the Linux runner cache from #10982, same “co-locate the per-account kura pod next to the fleet” idea, with a different hot-path data plane because the runtimes sit on different networks:
| Linux fleet (#10982) | macOS fleet (this PR) | |
|---|---|---|
| Runner runtime | kata Pod on the cluster pod network | Tart VM on the Scaleway PN |
| Cache node pool | co-located Linux pool (Hetzner) | co-located kura-scw-fr-par (Scaleway fr-par) |
| Hot-path data plane | ClusterIP DNS <instance>.kura.svc.cluster.local:4000 |
NodePort over the PN |
| Endpoint dispatch hands out | stable Service URL | http://<node PN IP>:<NodePort>, re-stamped as the pod moves |
| Source-IP / NetworkPolicy | in-cluster namespaceSelector |
externalTrafficPolicy: Local + ipBlock (PN CIDR) |
| Replication | in-cluster peer mesh | same peer mesh, now spanning fr-par↔Hetzner per account |
Both fleets converge through the same runner_cache.ex reconciler and the same runner_cache_endpoint_url/2 handoff; the region spec (data_plane, client_cidrs, pod_annotations, runner_platforms, mesh) is what makes one region cluster-DNS and another NodePort, so adding a locality is a values change, not a code change.
Per-account CA isolation
Cross-region peering is gated on the new mesh flag, set on every managed and private region (single cluster, so in-cluster discovery spans all of an account’s nodes with no cross-cluster gateway needed). The controller maintains a per-account CA (kura-<account>-peer-ca) and signs each instance’s peer leaf from it, with SANs covering the account peer Service (the SNI the replication client verifies). This is both the isolation boundary and a functional requirement: a leaf signed by one account’s CA is rejected at the TLS handshake by another account’s pods, and the previous auto-issued per-instance self-signed CA could not have authenticated an account’s own instances to each other. Control-plane auth/usage still goes to KURA_CONTROL_PLANE_URL over the public path; that is unchanged.
Why this design
The Mac VMs are not on the cluster’s pod network, so the Linux fleet’s ClusterIP path can’t serve them. Three candidate data planes:
- Tailnet subnet router (the interim this branch started with): works, but adds WireGuard crypto on the host, an MTU-1280/MSS-1200 clamp, rides the minis’ metered public bandwidth, and funnels through one router pod.
- Node-as-router over the PN: needs
bpf-lb-external-clusterip=trueflipped cluster-wide (Cilium KPR doesn’t translate ClusterIPs for externally-arriving traffic), and a single routed gateway caps the entire fleet at one node’s NIC. - NodePort over the PN (chosen): no cluster-wide Cilium change, client source IPs survive (
externalTrafficPolicy: Local) so NetworkPolicies stay meaningful, and it scales out, since each account’s endpoint follows its pod’s node so adding nodes adds aggregate bandwidth with no gateway bottleneck.
The M4 fleet move decides it: Scaleway gives M4-M/M4-Pro minis 10 Gb/s on the Private Network included, vs 1 Gb/s public by default (M1/M2 are 1/1). Instance-side PN bandwidth is a separate budget from public, so vxlan/control traffic never competes with cache reads. The PN is simultaneously the isolated path and the fast path.
How
- kura-controller:
KuraInstancegainsexposeNodePort(renders an<instance>-externalNodePort Service pinned to the primary pod, allocated ports preserved across reconciles),clientCIDRs(NetworkPolicy ipBlock),podAnnotations(per-regionkubernetes.io/egress-bandwidthcaps), andmesh(controller-managed per-account peer CA + account-scoped leaf SANs; flipscrossRegionRuntimeEnabledsoKURA_GLOBAL_DISCOVERY_DNS_NAMEand the peer mesh turn on). Status reportsnodeAddress(the node’stuist.dev/pn-ipv4label) + allocated NodePorts. - server: the
scw-fr-par-runnersregion declaresdata_plane: :node_port, the PN client CIDR, and a 750M egress cap; every managed and private region setsmesh: true. Activation waits for the observed node-port chain the way it waits for public DNS, and a refresh on every converged reconciler tick re-stampsServer.urlwhen the primary pod moves nodes. Dispatch itself is untouched (it already servesServer.url). - infra: scaleway-csi chart values + ExternalSecret (per-env
SCALEWAY_API1P item) providing thescw-bssdStorageClass; hcloud-csi’s node plugin skips thekura-scw-fr-parpool. The deploy now builds + pins the capi and runners-controller images per deploy-SHA (see Bugs).
Bugs found during the staging rollout (fixed here)
- CRD pruning + stale manifest revision (34c5f3ccf7): helm never upgrades
crds/on upgrade, so instances created against the old CRD had the new spec fields silently pruned, and a matching manifest-revision annotation meant they were never re-applied. Revision bumped so every instance converges declaratively once the CRD lands. - Nodes RBAC in a namespaced Role (ea5e2a7788): Nodes are cluster-scoped; controller-runtime’s Node informer got “nodes is forbidden” and every KuraInstance reconcile stalled. Now a dedicated ClusterRole.
- Server SA was a helm hook (10d07f737d):
before-hook-creationdeleted and recreated thetuist-serverServiceAccount on every deploy, invalidating every running pod’s projected token for up to a token-refresh period (kura-reconciler 401s, and the runner-dispatch TokenReview path broke after every deploy). The migration Job now has its own hook-managed SA; the server SA is a plain release resource withresource-policy: keep. (Superseded by main’s evolution of the same fix during the rebase.) - capi / runners-controller flag-vs-image skew (bbb861eedc, 3bb053f80bd): the deploy resolved these two component images from the latest
<component>@semver tag while deploying the chart at the branch SHA. This branch adds new manager flags to both (--vm-cluster-dns-ip/--vm-kura-egress-cidron capi;-cluster-dns-ip/-cluster-domainon runners-controller), so the stale released binaries hitflag provided but not defined,os.Exit(2), CrashLoopBackOff. helm--waitthen failed and--atomicrolled the whole release back (it ate two staging deploys before this was understood). Fix: build and pin both images per deploy-SHA, exactly as the server and kura-controller images already are, so the binary and the flags the chart passes it always come from one commit. This eliminates the skew class for good; the@-semver release path stays for tracking only.
Declarative provisioning (found by driving it live on staging)
The runner-cache node is now ordered declaratively by the in-house Scaleway CAPI provider (a new ScalewayInstanceMachine kind + kura-fleet MachineDeployment), instead of being hand-joined. Bringing that path up end-to-end surfaced a chain of bugs, each fixed here:
- Provision/delete not idempotent (346e0219e1f):
CreateInstancebundled create + PN-attach + power-on, so a partial failure stranded a half-configured paid instance, andDeleteInstanceunconditionally powered off (Scaleway rejects that on an already-stopped server, wedging finalizer removal). Split into find-or-create + idempotentEnsurePrivateNIC/EnsurePoweredOn(ServerID recorded before them, providerID only after both succeed) and a stop-then-delete state machine tolerant of already-stopped / already-gone. - No public IP (87761e20a35): the instance came up with only its PN NIC, so cloud-init couldn’t pull kubelet/containerd or reach the externally-managed control plane and the node sat in Bootstrapping forever. Request a dynamic public IPv4; cache traffic still rides the PN.
- cloud-init aborted under dash (7b365e10711):
runcmdruns under/bin/sh(dash), which rejectsset -o pipefail, so the whole bootstrap aborted on its first line and kubelet was never installed. Moved the bootstrap into a bash script invoked viaruncmd: [bash, ...]. - Kubelet RBAC incomplete (6ea62c378d1, eacf38286b9, 770df4c5eb0): the per-node identity bound to the chart’s
tart-kubeletClusterRole, which only carries what the macOS shim exercises. A real Linux kubelet was forbidden its heartbeat Lease and Services (every system pod failedCreateContainerConfigError "services have not yet been read"), then PVC/PV/volumeattachments and csidrivers once cache pods landed. Added the missing perms, then bound the Linux identity tosystem:nodefor good (5c3d7ed66ba, see follow-ups). - pn-ipv4 never resolved (d9f8cb379ef, 6674f866b37, 868c67a90ba): the IPAM
ListIPscall set bothZonalandPrivateNetworkID(the API wants exactly one), filtered by the lowercase Instance MAC while IPAM stores it uppercase, returned the NIC’s first address (the IPv6), depended onGetServerpopulatingPrivateNics, and omitted the regionalRegionfield, all under a swallowed error so the label silently never landed. Now lists by PN + Region, matches the row by resource name or MAC (case-insensitive), returns the IPv4, requeues until stamped, and logs the outcome. - scaleway-csi didn’t tolerate the runner-cache taint (8356026f832): its node DaemonSet never scheduled on the tainted fleet node, so the node’s CSINode registered no scaleway driver and the scw-bssd cache PVCs couldn’t attach there. Added the toleration.
Declarative node provisioning (replaces the hand-joined stand-in)
The kura-scw-fr-par node is provisioned declaratively by the cluster-api-provider-scaleway-applesilicon operator. A new ScalewayInstanceMachine kind orders a regular Scaleway Instance (PRO2-S) with a dynamic public IP, attaches it to the runner-cache PN, mints a per-machine kubelet identity, and renders a self-join cloud-init that installs containerd/kubelet and registers the node (this cluster’s control plane is externally managed, so there is no CAPI kubeadm join). A kura-fleet MachineDeployment drives it with the runner-cache taint and pool label; the controller stamps the foreign providerID and the tuist.dev/pn-ipv4 label. scaleway-csi and the node kubelet RBAC both tolerate the runner-cache taint so the cache pods’ scw-bssd volumes attach on the node.
Validated end-to-end on staging: the declarative node joined and reached Ready with providerID + tuist.dev/pn-ipv4 + pool label + taint, all 11 per-account cache StatefulSets migrated onto it (volumes re-attached via scaleway-csi, every KuraInstance Ready and meshed), and the original hand-joined PRO2-S was drained and its instance terminated.
Two operational prerequisites live in Scaleway IAM (not the repo) and must be granted to each environment’s provider key: PrivateNetworksFullAccess (attach the instance to the PN) and IPAMReadOnly (read the PN address for the pn-ipv4 label).
Cache node sharing: pod-per-customer on a shared node (for now)
We run one cache pod per customer, but co-located on a shared Scaleway node, not a dedicated node per customer. The cache pod is I/O-bound and light on CPU/RAM (the Hetzner kura nodes run on 2 vCPU ccx13), so one PRO2-S comfortably hosts many per-account pods. Sharing amortises the node: one ~EUR160/mo PRO2-S serves all accounts, and adding a customer adds only a ~EUR4/mo cache volume.
The trade-off is bandwidth. Scaleway gives each instance a dedicated, guaranteed per-connection allocation (unlike Hetzner Cloud, whose cloud VMs are best-effort shared at ~300-500 Mbps), but it’s coupled to instance size: PRO2-S is 1.5 Gbps to the PN, separate from its 1.5 Gbps internet. On a shared node that 1.5 Gbps PN is split across the co-located pods behind a 750M per-tenant egress cap, so roughly 2 tenants at full cache-read saturate the node; the fleet-spread controller distributes pods as the pool grows.
This is a values-level decision, reversible per environment. Moving to a dedicated node per (enterprise) customer is a config change, not a rewrite: the region spec and per-account scheduling already model pod-per-customer, so we would raise kuraFleet replicas / add a per-customer pool and pin each account’s pod. Scaleway economics are ~EUR100/Gbps/month either way: a dedicated PRO2-S gives a customer the full uncontended 1.5 Gbps (~EUR165/mo), or a POP2-HN (“High Network”) instance scales to 5 Gbps on lighter compute (~EUR495/mo). To know when that move is worth it, this PR adds host-NIC bandwidth analytics (the “Node Bandwidth” dashboard panels + the hostMetrics node_network_* allow-list) so we can watch the shared node’s PN throughput against the Hetzner kura nodes.
Update: measured bare metal vs PRO2-S (supersedes the ~EUR100/Gbps speculation above)
We benchmarked this directly on Scaleway (throwaway staging boxes, PRO2-S vs an EM-B220E-NVME Elastic Metal box, same client, all over the PN):
| Over the PN | PRO2-S (today) | Elastic Metal EM-B220E |
|---|---|---|
| Sustained throughput | 1.52 Gbit/s (after a ~5 Gbit/s / ~10s burst, then a hard token-bucket clamp) | 6.7 Gbit/s flat, no clamp (capped by the 6.4G client; real ceiling ~10 Gbit/s) |
| Disk read (cold) | scw-bssd 523 MB/s (4.2 Gbit/s) | local NVMe 3442 MB/s (27.5 Gbit/s) |
| Realistic 6 GB parallel pull, cold / warm | 2.78 / 2.05 Gbit/s | 4.68 / 4.67 Gbit/s |
| Price | ~EUR163/mo, 32 GB RAM | ~EUR120/mo, 64 GB RAM |
Two things the spec sheet hides. First, the PRO2-S PN is token-bucketed: a single bursty cache fetch rides ~5 Gbit/s, but sustained / multi-tenant load clamps to 1.52 Gbit/s. Second, the disk is not the PRO2-S bottleneck (scw-bssd’s 4.2 Gbit/s sits above the PN), so NVMe only matters once the NIC can outrun it, which is exactly the 10G Elastic Metal regime. Net: Elastic Metal is ~1.7-2.3x on a single bursty pull and ~4.4-6.5x under sustained shared load, with no burst-credit dependency, and it is cheaper.
So the direction this PR takes (now landed — see the closing Elastic Metal pivot landed update) is to put the cache node on Scaleway Elastic Metal rather than a bigger instance: a ScalewayElasticMetalMachine kind in the in-house CAPI provider plus a local-NVMe storage class (Elastic Metal can’t attach scw-bssd, but a regenerable cache wants fast local NVMe anyway). The shared pod-per-customer model is unchanged; the node underneath just gains a 10G NIC and local NVMe. A box like EM-B220E (64 GB RAM, 2x1 TB NVMe, 10G) realistically holds ~15-20 customer pods before RAM (warm-set page cache) or local disk binds, with the PN comfortably oversubscribing bursty CI load; a larger box (EM-B320E / EM-I220E) scales that toward ~40.
Sizing the bare-metal node. Optimize for RAM (cgroup-charged page cache = each tenant’s warm set and hit rate) and local NVMe capacity (per-account volumes) — not PN bandwidth or CPU, which a 10G/25G box oversubscribes for a light, bursty cache. By €/pod (~EUR5.5-6.5 across the balanced boxes) the Iridium line is the sweet spot: RAM-rich, balanced NVMe, 25G PN headroom. So production runs EM-I220E (128 GB RAM / 1.92 TB NVMe / 25G PN, ~EUR230/mo, ~30-40 tenants; run 2+ for failure isolation, step up to EM-I320E when a node fills), and staging/canary run EM-B220E (~EUR120/mo, the cheapest in-stock box that still mirrors the production NVMe + 10G path for validation). Set per-env via kuraFleet.machine.offerType. The binding constraint is RAM + disk, not bandwidth: scale out (add nodes) when those fill.
Validation
All of the below ran against the branch deployed to staging (helm rev 292, all of server / kura-controller / capi / runners-controller on the same per-SHA image).
-
kura-controller
go test ./...(incl. new per-account-CA tests: leaf signed by account CA, CA shared across an account’s instances, cross-account leaf rejected); server kura + runners-controller suites. -
Mesh-wide live: 11 per-account CA secrets + 11 account peer Services; the fr-par runner-cache node discovers all 4 of an account’s peers, completes bootstrap, applies replicated artifacts, and keeps its push outbox drained to 0. Sampled accounts have distinct CA fingerprints (cross-account isolation holds on the live cluster).
-
macOS-over-PN combined smoke 27554115810: real dispatch-claimed macOS runner VM, routed phase pinned to the PN NodePort
http://172.16.0.2:30985, real Gradle build-cache traffic:Phase (full Gradle build wall-clock) Routed (PN) Baseline (public) Speedup Cold (local caches purged) 4.52 s median 23.11 s 5.11x Warm 4.08 s median 9.25 s 2.27x -
Per-request latency from the mini: PN NodePort
/upp50 2.3 ms vs public path p50 70 ms / p90 147 ms, roughly 30x per round-trip; cold-build savings approximate that delta times the build’s serialized cache round-trips. -
PN throughput mini→node: 117 MB/s = M2 NIC line rate (M4 raises the PN ceiling to 10 Gb/s).
-
Declarative provisioning + migration (live, staging): the
kura-fleetMachineDeployment ordered the PRO2-S, the controller attached the PN + public IP, the bash cloud-init self-joined, and the node reached Ready with providerID +tuist.dev/pn-ipv4: 172.16.0.4+ pool label + taint. All 11 per-account cache StatefulSets then drained off the hand-joined node onto the declarative node (volumes re-attached, all KuraInstances Ready and meshed), and the hand-joined instance was terminated.
Elastic Metal pivot landed + macOS runner-cache validated e2e on staging
The bare-metal pivot described above is no longer a follow-up — it landed in this PR and is serving on staging. The PRO2-S provisioning and validation sections above are the now-historical stand-in that preceded it (kept for the reasoning trail).
What landed. The in-house Scaleway CAPI provider gained a ScalewayElasticMetalMachine kind (orders an Elastic Metal box, enables then attaches the PN as a server option, self-joins via cloud-init), driven by the kura-fleet MachineDeployment. Elastic Metal can’t attach scw-bssd, so the cache’s scw-local-nvme StorageClass is backed by a local-path provisioner on the box’s NVMe. Box per env via kuraFleet.machine.{kind: elasticMetal, offerType}: staging/canary EM-B220E-NVME, production EM-I220E.
Validated e2e on staging (our own tuist account, region scw-fr-par-runners), after fixing the bugs below:
- EM node provisions declaratively (order → PN-option enable → PN attach VLAN → self-join → Ready with clusterDNS +
tuist.dev/pn-ipv4+ runner-cache taint) via the in-cluster operator. - Cache pod
1/1on the EM node, backed by local NVMe,state:serving. - Meshes with all 4 of the account’s Hetzner peers (eu-central-1 ×3 + staging-0),
writer_lock_owned— cross-cloud (Scaleway↔Hetzner) replication over the Cilium overlay. - Server marks the Kura.Server
:active;runner_cache_endpoint_url(account, :macos)returns the node-port URLhttp://<node PN IP>:<NodePort>, and the cache answers200on that exact PN endpoint. - The staging runners smoke (
runners-staging-smoke.yml) now asserts this from inside a real Tart VM (curl $TUIST_CACHE_ENDPOINT/upover the PN).
Bugs found driving the EM pivot live (fixed here):
- local-path-provisioner missing
pods: create/deleteRBAC: the provisioner sets up each PV’s hostPath via a short-lived helper Pod; without create/delete on pods every provision failed with a 403 and the cache PVCs hungPending, so the pod never scheduled onto the EM node. - apiserver
kubelet-preferred-address-typeshad noInternalIP: it wasExternalIP,Hostname,InternalDNS,ExternalDNS. Cross-cloud nodes (Elastic Metal + the macOS PN fleet) report a reachableInternalIPbut noExternalIPand a Hostname the Hetzner apiserver can’t resolve, so it fell through to the Hostname andkubectl logs/execto those nodes’ pods failed (no such host). Delivered as a ClusterClasskubeletPreferredAddressTypesvariable + patch — CAPI rejects in-placeKubeadmControlPlaneTemplateedits (immutable spec). The default equals the current value, so it applies as a verified no-op (every live control plane already carries it); insertingInternalIPis then a deliberate per-env variable flip that rolls only that env’s control plane. KuraInstancestorage-class migration isn’t declarative (finding): a StatefulSet’svolumeClaimTemplatesare immutable and the controller updates-in-place, so flipping a live instance’sstorageClassName(scw-bssd→scw-local-nvme) silently no-ops; it needs a StatefulSet delete+recreate (done manually here).- stuck-
:failedrunner-cache nodes aren’t auto-retried (finding):nodes_to_retryonly self-heals servers withcurrent_image_tag == nil, so a node that deployed then failed is never re-provisioned; cleared by an operator reset during bring-up. - mise lockfile gaps for the staging deploy (pomerium/cli, jq, kind): added
linux-x64lock entries so the deploy job’smise installdoesn’t hit GitHub API rate limits resolving them. - deploy-workflow component image overrides (extends the capi/runners flag-skew fix above): added
capi_image_tag+runners_controller_image_taginputs toserver-deployment.ymlso a deploy can pin a per-SHA build of either component instead of resolving the stale@-semver release.
Pre-merge decisions / follow-ups
- Taint the Scaleway runner-cache node
tuist.dev/runner-cache=true:NoSchedule: done declaratively (the reconciler registers the node with the taint via cloud-init--register-with-taints). The cache pods’, scaleway-csi’s, and kubelet’s matching tolerations all ship here. - Per-environment provider IAM + PN:
PrivateNetworksFullAccess+IPAMReadOnly(+ BlockStorage) verified live on the staging, canary, and production provider keys; a runner-cache Private Network per env (172.16.0.0/22, each in that env’s own Scaleway project), resolved by nametuist-runner-cache(the operator find-or-creates it), so no UUID is pinned in values. - Canary / production macOS-cache cutover: scaleway-csi installed with the runner-cache toleration on both clusters;
kuraFleet.enabled: truefor both, so each env orders its node, self-joins, and binds tosystem:node. With no customers on the runners yet, the macOS hot path is flipped on for canary and production (not just staging):scw-fr-par-runnersadvertised inTUIST_KURA_AVAILABLE_REGIONS,TUIST_RUNNERS_CLUSTER_NETWORK_PLATFORMS=linux,macos, and the minis attached viamacosFleet.vmCachePrivateNetwork. Each env provisions in-cluster onto its own EM node. - Per-cluster prerequisites for the canary / production cutover (not in the deploy pipeline; until present, dispatch returns no endpoint and builds fall back to the public cache, so the cutover degrades rather than breaks): (1) apply the
scw-local-nvmeStorageClass once the EM node joins,kubectl apply -f infra/k8s/mgmt/bootstrap/local-path-provisioner.yaml; (2) rename each env’s runner-cache PN totuist-runner-cache. Runbook:infra/cluster-api-provider-scaleway-applesilicon/docs/scaleway-elastic-metal-support.md. - Scope the
tart-kubeletClusterRole back to the macOS shim once the one legacy staging node (still bound to it) is recycled; new Linux nodes usesystem:node. - Drop the now-obsolete tailnet pieces (Connector proxyClass pin,
accept-routes/Service-CIDR advertisement for macs, MSS-clamp host rules) or keep them as fallback, since NodePort made the tailnet cache path redundant. - Keep or remove the smoke workflow file (the Linux PR removed its scaffolding pre-merge; durable tooling lives in tuist/cache-benchmark#2).
- Cut
kura-controllerrelease tags post-merge and drop the staging image-tag pins. - Flip the
kubeletPreferredAddressTypesClusterClass variable per env (staging → canary → production) to insertInternalIP; each flip rolls only that env’s control plane. The mechanism is committed and applies as a no-op; after staging’s flip, confirmkubectl logs/execwork against EM + macOS PN node pods. To flip: setkubeletPreferredAddressTypes: ExternalIP,InternalIP,Hostname,InternalDNS,ExternalDNSunder that Cluster CR’sspec.topology.variables. - Make the
KuraInstancecontroller recreate the StatefulSet on astorageClassNamechange (thescw-bssd→scw-local-nvmemigration was manual). - Auto-retry stuck-
:failedrunner-cache nodes (today onlycurrent_image_tag == nilself-heals).
🤖 Generated with Claude Code
The Private Network + NodePort data plane for the Scaleway macOS runner cache is now available in runner-image@0.7.0. Update to the new runner image tags (ghcr.io/tuist/tuist-runner:macos-26-5-0.7.0, ghcr.io/tuist/tuist-runner:macos-26-4-1-0.7.0, or ghcr.io/tuist/tuist-runner:macos-26-3-0.7.0) to use this feature.