Hive Hive
Sign in

fix(server): tailnet-expose CNPG pooler for xcresult-processor

GitHub issue · Closed

Metadata
Source
tuist/tuist #11134
Updated
Jun 24, 2026
Domains
Compute
Details

Summary

After the CNPG cutover, process_xcresult stopped draining: 700+ jobs queued by the time the staging fix landed, zero executed. The xcresult-processor BEAM running inside Tart VMs on the macOS fleet couldn’t reach the in-cluster CNPG pooler. Pre-cutover the URL was a public Supabase hostname (routed through the VM’s internet egress, worked); post-cutover it pointed at an in-cluster Service that the VM has no path to.

The Mac mini host is on the tailnet (tag:tuist-macmini-<env>). The Tart VM running on it isn’t. tart-cri’s CNI does vmnet shared-NAT only — the VM has no K8s overlay membership and no tailnet identity, so cluster ClusterIPs and tailnet CGNAT addresses are both unreachable. Linux runner microVMs sidestep this through Kata’s CNI integration (overlay membership); macOS Tart VMs need a different answer.

What landed

Tailscale-expose the CNPG pooler. The Pooler CR’s serviceTemplate.metadata.annotations carry tailscale.com/expose: "true" + tailscale.com/hostname: tuist-pg-pooler-<env> + tailscale.com/tags: tag:tuist-k8s-<env>. The Tailscale operator provisions a userspace proxy on the tailnet that fronts the in-cluster pooler ClusterIP.

tailscaled inside each Tart VM. The xcresult-processor Tart image’s Packer build runs brew install tailscale + tailscaled install-system-daemon. A new tailscale-up.sh runs at boot before the BEAM, joins the tailnet using the auth key the K8s Deployment injects via TAILSCALE_AUTH_KEY (reuses the same per-env tag:tuist-macmini-<env> auth key the host bootstrap uses), and pins each tailnet peer’s IPv4 + hostname into /etc/hosts from tailscale status --json. The /etc/hosts rewrite works around a documented quirk: the open-source tailscaled variant on macOS — the only headless choice, since the App Store variant is GUI-only — can’t reliably push DNS into scutil, so --accept-dns=true is effectively a no-op for the BEAM’s gethostbyname.

Idempotent VM boot. tailscale up --reset runs only on first boot — tailscale-up.sh checks tailscale ip -4 first and skips the heavy registration when already joined. Without this, launchd’s KeepAlive would re---reset on every BEAM exit, race the 30s tailscale ip -4 wait, time out, exit 1, and crash-loop forever (staging hit runs = 2960 before this fix landed).

Tailscale SSH on the VM + ACL grant for diagnostics. tailscale-up.sh sets --ssh=true. infra/tailscale/acls.json carries an action: accept SSH grant from each env’s tag:tuist-k8s-<env> to its matching tag:tuist-macmini-<env>, so the cluster-side subnet-router pod can tailscale ssh admin@<vm-tailnet-ip> for diagnostic shell. This is the substitute for kubectl logs / kubectl exec, which fail against tart-kubelet because the K8s apiserver can’t DNS-resolve Mac mini hostnames.

Chart wiring (infra/helm/tuist/):

  • processor-external-secrets.yaml renders a second URL key xcresult-processor-database-url pointing at the Tailscale-exposed pooler hostname. The xcresult-processor Deployment opts in via the existing xcresultProcessor.databaseUrlSecretKey override. Linux processors keep the direct in-cluster URL via processor-database-url.
  • xcresult-processor Deployment gains TAILSCALE_AUTH_KEY (from the macOS-fleet ExternalSecret-synced Secret) + TAILSCALE_HOSTNAME (Downward API → Pod name) env vars when xcresultProcessor.tailscale.enabled is true.
  • Per-env values (staging, canary, production) flip xcresultProcessor.tailscale.enabled=true and postgresql.cnpg.pooler.tailscale.enabled=true.
  • xcresultProcessor.image.tag pinned to c6d08f92 until release-xcresult-processor-image rewrites it.

Why not the alternatives

  • Advertise the K8s service CIDR via the existing subnet router. Single Connector edit, but exposes every ClusterIP across the cluster to the whole tailnet — permanent ACL surface. Per-Service exposure via the Tailscale operator is surgical.
  • Public-expose the pooler via Hetzner LB. Treats xcresult-processor as a third-party client of a public DB, which is the opposite of how Linux processors are modeled. Bigger credential surface.
  • Host-side TCP forwarder. Earlier iteration that’s now stripped — see commit refactor(server): strip the host-tcp-forwarder code path. The shape “VM dials its vmnet gateway, host relays” fights macOS’ default vmnet semantics: empirically the VM’s outbound packets to the host’s listener never arrived, even with the daemon binding 0.0.0.0. Replaced by tailscaled-in-VM which makes each VM a first-class tailnet member instead of routing around its lack of identity.
  • Split processor into Linux Oban consumer + macOS gRPC backend. The architecturally cleanest answer (no tailnet hacks), but a multi-day server-code change — disproportionate when the in-VM tailscaled approach matches the host’s pattern and fits in a Packer step.

Validation

Proven end-to-end on staging via the chart’s xcresult-processor-database-url rendering:

BEAM (Tart VM, tailnet 100.112.76.33)
→ resolves tuist-pg-pooler-staging via /etc/hosts (libc resolver)
→ dials 100.125.134.53:5432 via utun4 (tailscaled WireGuard)
→ Tailscale operator-managed proxy
→ CNPG PgBouncer pooler ClusterIP
→ CNPG primary

BEAM-side PromEx metrics (via tailnet scrape of port 9091 inside the VM):

tuist_repo_pool_ready_conn_count{repo="postgres"} 10
tuist_repo_pool_size{repo="postgres"} 10
tuist_repo_pool_checkout_queue_starved_samples{repo="postgres"} 0

10/10 healthy Postgres connections. launchd runs = 1, state = running (vs runs = 2960 on the crash-loop iteration). tailscale ssh admin@<vm> works through the new ACL grant.

pg_stat_activity does not show tuist_processor because PgBouncer in transaction mode pools backend connections under cnpg_pooler_pgbouncer — the BEAM-side pool metric is the authoritative signal.

Test plan

  • Local: helm template for staging / canary / production renders the Pooler serviceTemplate annotations, the second URL key in the ExternalSecret, the env injection in the xcresult Deployment, the same-env SSH grants in acls.json.
  • Operator + bootstrap modules: go build, go vet, go test all clean.
  • Staging deploy: image c6d08f92 rolled, VM joined tailnet, /etc/hosts pinned, BEAM connected, pool full.
  • CI: chart-test + go-build + helm-lint + ACL check
  • Canary cascade on merge: pooler proxy provisions for tuist-pg-pooler-canary, VM connects.
  • Production cascade on merge: same against tuist-pg-pooler-production. Drains the 700+ backlog.

Rollout notes

  • xcresultProcessor.image.tag (c6d08f92) and macosFleet.image.tag (b8a5aa79) are pinned to SHA builds from this PR’s dispatched workflow runs. release-xcresult-processor-image and release-capi-scaleway will rewrite them to semver tags on the next push to main that matches their respective conventional-commit scopes.
  • Existing Mac minis pick up the new operator binary on next reconcile. Each VM rolls when the xcresult-processor Deployment rolls (helm upgrade replaces the Pod, tart-kubelet boots a new Tart VM from the new image tag).
  • The Tailscale ACL diff (three action: accept SSH grants) was applied to the admin console before the staging diagnosis. The repo file is now in sync with the live ACL.
Comments
TA
tuist-atlas[bot] Jun 9, 2026

The change to tailnet-expose the CNPG pooler for xcresult-processor is now available in xcresult-processor-image@0.12.3. Update to this version to use it.

TA
tuist-atlas[bot] Jun 9, 2026

The changes from this pull request are now available in runner-image@0.3.1. Update to this version to receive the tailnet-exposed CNPG pooler fix for the xcresult-processor.