perf(infra): reduce Grafana metric cardinality

GitHub issue · Closed

Open on GitHub

Metadata

Source

tuist/tuist #11395

Updated

Jun 24, 2026

Domains

Compute

Details

Resolves N/A

This reduces Grafana Cloud metric cardinality for the managed monitoring stack. The main issue is not pod count by itself, but high-churn labels attached to pod and runner metrics. Those labels turn otherwise equivalent samples into thousands of active series during runner scale events.

KSM means kube-state-metrics. It is the Kubernetes exporter that turns object state into Prometheus metrics, for example kube_pod_status_phase, kube_pod_info, deployment status, PVC status, and node metadata.

What changed:

Added Grafana Cloud write relabeling for restart-scoped Kubernetes labels, unused network and service graph metrics, and low-value integration jobs.
Removed pod-scoped kube-state-metrics from the high-churn tuist-runners namespace while keeping kube_pod_status_unschedulable available for placement failures.
Added tuist_runners_pool_phase_replicas{pool,phase} in the runners-controller so the dashboard keeps the macOS ready vs cold-booting split without per-Pod KSM series.
Split runners-controller metric cleanup so autoscaling-disabled pools do not lose their pool phase metrics.
Deferred pool phase metric publishing after pod listing so retry paths still publish the latest known ready and cold-booting counts.
Switched the runner replica dashboard from kube_pod_status_phase to the new low-cardinality runners-controller phase metric.
Reduced the Alloy self integration to config-load health metrics plus the chart-required scrape health metrics.
Removed the unverified annotation autodiscovery instance="macos-runner" relabel from this PR. Stable per-Mac-mini instance labels remain intact for host health scrapes.

Why this shape:

Blanking the pod label on kube_pod_* metrics would not safely collapse samples. Prometheus relabeling cannot aggregate duplicate labelsets, so that approach risks colliding series instead of producing namespace-level totals. Dropping pod-scoped KSM for tuist-runners is the safer cut because the runner dashboard can use RunnerPool telemetry for desired capacity and the runners-controller’s new pool phase metric for ready vs cold-booting capacity. Kura and CNPG still keep their pod-scoped metrics because their dashboards explicitly join or group by pod.

kube_pod_status_unschedulable stays available cluster-wide. It is a low-cardinality placement signal compared to the rest of kube_pod_*, and keeping it avoids breaking ad-hoc cluster checks while still removing the runner namespace’s highest-churn pod phase fan-out.

The macOS instance relabel was intentionally removed. We should keep stable per-Mac-mini instance labels on host health scrape jobs because they are bounded and operationally useful. If we later verify a separate ephemeral runner scrape path is creating high-cardinality instance values, that relabel should be added on that specific scrape job.

Impact:

Change	Expected impact	Tradeoff
Drop restart-scoped labels such as `container_id`, `uid`, `pod_ip`, and image identity labels	Collapses series that only differ by ephemeral Kubernetes identity	Removes labels that are not used by dashboards in this repo
Drop unused metric families such as `traces_service_graph_request_*`, network drop counters, and PV phase	Removes whole unused series families before ingestion	PV phase inventory is no longer sent to Grafana Cloud, while PVC and kubelet volume usage remain available
Drop most `kube_pod_*` metrics for `tuist-runners`	Removes the highest-churn pod-name fan-out from runner scale events	Replaces runner phase visibility with a pool-level controller metric
Keep `kube_pod_status_unschedulable`	Preserves the placement failure signal used by the runner dashboard and ad-hoc cluster checks	Retains one pod-scoped KSM metric family cluster-wide
Add `tuist_runners_pool_phase_replicas{pool,phase}`	Preserves linux alive plus macOS ready vs cold-booting visibility with pool x phase cardinality	Adds one low-cardinality controller gauge family
Split metric cleanup by owner	Keeps static RunnerPools from losing phase samples when autoscaling is disabled	Leaves full cleanup to the primary RunnerPool reconciler when the pool object is gone
Defer phase metric publishing during reconcile	Keeps ready and cold-booting counts fresh on create/delete retry paths	Publishes latest known counts even when a later mutation fails and reconcile retries
Narrow Alloy self metrics to config-load health and scrape health	Keeps stale-config alerting while dropping broad Alloy runtime series	Detailed Alloy runtime internals are no longer retained in Grafana Cloud
Allow-list macOS node exporter metrics	Keeps host health signals used for Mac mini operations	Drops long-tail node exporter metrics that are not queried by Tuist dashboards
Switch RunnerPool dashboard to controller phase telemetry	Removes dashboard dependency on KSM pod phase cardinality	Phase visibility is rolled up by pool and platform, without per-Pod drilldown
Remove the unverified macOS runner `instance` relabel	Avoids normalizing the wrong scrape path and preserves per-machine host health	Ephemeral runner `instance` cardinality remains a follow-up until its live source is verified

Developer impact:

The runner dashboard keeps the previous linux alive and macOS ready vs cold-booting split, but the source is now runners-controller telemetry instead of KSM pod phase series. The unscheduled runner signal remains available for placement failures. Stable per-Mac-mini host health labels are preserved.

How to test locally

python3 -m json.tool infra/grafana-dashboards/runners.json >/tmp/runners-json-check.json
~/.local/share/mise/installs/go/1.25.10/bin/gofmt -w internal/metrics/metrics.go internal/metrics/metrics_test.go controllers/autoscaler_controller.go controllers/runnerpool_controller.go from infra/runners-controller
~/.local/share/mise/installs/go/1.25.10/bin/go test ./... from infra/runners-controller
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-staging.yaml
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-canary.yaml
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-production.yaml
helm template tuist-k8s-monitoring infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-staging.yaml > /tmp/tuist-k8s-monitoring-staging.yaml
helm template tuist-k8s-monitoring infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-canary.yaml > /tmp/tuist-k8s-monitoring-canary.yaml
helm template tuist-k8s-monitoring infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-production.yaml > /tmp/tuist-k8s-monitoring-production.yaml
rg -n "kube_pod_status_unschedulable|__tmp_keep_runner_pod_metric|container_id|integrations/process|integrations/self|node_network_\(receive\|transmit\)_drop_total|traces_service_graph_request" /tmp/tuist-k8s-monitoring-staging.yaml /tmp/tuist-k8s-monitoring-canary.yaml /tmp/tuist-k8s-monitoring-production.yaml
rg -n 'macos-runner|__meta_kubernetes_pod_label_tuist_dev_runner|regex = "kube_pod_status_unschedulable;"' infra/helm/k8s-monitoring/values.yaml /tmp/tuist-k8s-monitoring-staging.yaml /tmp/tuist-k8s-monitoring-canary.yaml /tmp/tuist-k8s-monitoring-production.yaml returned no matches, confirming the unverified macos-runner relabel and old non-runner unschedulable drop no longer render.
git diff --check

Alloy CLI is not installed locally, so Alloy syntax validation was not run.

Comments

pepicrft Jun 19, 2026

@fortmarek addressed in 2deedbabae.

Split metric cleanup so the autoscaler only clears autoscaler-owned gauges. Static RunnerPools now keep tuist_runners_pool_phase_replicas{pool,phase} even when autoscaling is missing or disabled.
Moved phase publishing to a deferred record after the pod list. The counts are initialized from the listed alive Pods and updated only after successful create/delete mutations, so createRunner or scale-down failures still publish the latest known ready and cold-booting counts before retrying.
Removed the annotation autodiscovery instance="macos-runner" relabel from this PR. Stable per-Mac-mini scrape labels stay intact, and any ephemeral runner instance normalization can be added later on the scrape job that is confirmed to emit those series.

Validation rerun: runners-controller go test ./..., dashboard JSON check, k8s-monitoring Helm lint/template for staging/canary/production, rendered rule greps, and git diff --check.