Hive Hive
Sign in

perf(infra): reduce Grafana metric cardinality

GitHub issue · Closed

Metadata
Source
tuist/tuist #11395
Updated
Jun 24, 2026
Domains
Compute
Details

Resolves N/A

This reduces Grafana Cloud metric cardinality for the managed monitoring stack. The main issue is not pod count by itself, but high-churn labels attached to pod and runner metrics. Those labels turn otherwise equivalent samples into thousands of active series during runner scale events.

KSM means kube-state-metrics. It is the Kubernetes exporter that turns object state into Prometheus metrics, for example kube_pod_status_phase, kube_pod_info, deployment status, PVC status, and node metadata.

What changed:

  • Added Grafana Cloud write relabeling for restart-scoped Kubernetes labels, unused network and service graph metrics, and low-value integration jobs.
  • Removed pod-scoped kube-state-metrics from the high-churn tuist-runners namespace while keeping kube_pod_status_unschedulable available for placement failures.
  • Added tuist_runners_pool_phase_replicas{pool,phase} in the runners-controller so the dashboard keeps the macOS ready vs cold-booting split without per-Pod KSM series.
  • Split runners-controller metric cleanup so autoscaling-disabled pools do not lose their pool phase metrics.
  • Deferred pool phase metric publishing after pod listing so retry paths still publish the latest known ready and cold-booting counts.
  • Switched the runner replica dashboard from kube_pod_status_phase to the new low-cardinality runners-controller phase metric.
  • Reduced the Alloy self integration to config-load health metrics plus the chart-required scrape health metrics.
  • Removed the unverified annotation autodiscovery instance="macos-runner" relabel from this PR. Stable per-Mac-mini instance labels remain intact for host health scrapes.

Why this shape:

Blanking the pod label on kube_pod_* metrics would not safely collapse samples. Prometheus relabeling cannot aggregate duplicate labelsets, so that approach risks colliding series instead of producing namespace-level totals. Dropping pod-scoped KSM for tuist-runners is the safer cut because the runner dashboard can use RunnerPool telemetry for desired capacity and the runners-controller’s new pool phase metric for ready vs cold-booting capacity. Kura and CNPG still keep their pod-scoped metrics because their dashboards explicitly join or group by pod.

kube_pod_status_unschedulable stays available cluster-wide. It is a low-cardinality placement signal compared to the rest of kube_pod_*, and keeping it avoids breaking ad-hoc cluster checks while still removing the runner namespace’s highest-churn pod phase fan-out.

The macOS instance relabel was intentionally removed. We should keep stable per-Mac-mini instance labels on host health scrape jobs because they are bounded and operationally useful. If we later verify a separate ephemeral runner scrape path is creating high-cardinality instance values, that relabel should be added on that specific scrape job.

Impact:

Change Expected impact Tradeoff
Drop restart-scoped labels such as container_id, uid, pod_ip, and image identity labels Collapses series that only differ by ephemeral Kubernetes identity Removes labels that are not used by dashboards in this repo
Drop unused metric families such as traces_service_graph_request_*, network drop counters, and PV phase Removes whole unused series families before ingestion PV phase inventory is no longer sent to Grafana Cloud, while PVC and kubelet volume usage remain available
Drop most kube_pod_* metrics for tuist-runners Removes the highest-churn pod-name fan-out from runner scale events Replaces runner phase visibility with a pool-level controller metric
Keep kube_pod_status_unschedulable Preserves the placement failure signal used by the runner dashboard and ad-hoc cluster checks Retains one pod-scoped KSM metric family cluster-wide
Add tuist_runners_pool_phase_replicas{pool,phase} Preserves linux alive plus macOS ready vs cold-booting visibility with pool x phase cardinality Adds one low-cardinality controller gauge family
Split metric cleanup by owner Keeps static RunnerPools from losing phase samples when autoscaling is disabled Leaves full cleanup to the primary RunnerPool reconciler when the pool object is gone
Defer phase metric publishing during reconcile Keeps ready and cold-booting counts fresh on create/delete retry paths Publishes latest known counts even when a later mutation fails and reconcile retries
Narrow Alloy self metrics to config-load health and scrape health Keeps stale-config alerting while dropping broad Alloy runtime series Detailed Alloy runtime internals are no longer retained in Grafana Cloud
Allow-list macOS node exporter metrics Keeps host health signals used for Mac mini operations Drops long-tail node exporter metrics that are not queried by Tuist dashboards
Switch RunnerPool dashboard to controller phase telemetry Removes dashboard dependency on KSM pod phase cardinality Phase visibility is rolled up by pool and platform, without per-Pod drilldown
Remove the unverified macOS runner instance relabel Avoids normalizing the wrong scrape path and preserves per-machine host health Ephemeral runner instance cardinality remains a follow-up until its live source is verified

Developer impact:

The runner dashboard keeps the previous linux alive and macOS ready vs cold-booting split, but the source is now runners-controller telemetry instead of KSM pod phase series. The unscheduled runner signal remains available for placement failures. Stable per-Mac-mini host health labels are preserved.

How to test locally

  • python3 -m json.tool infra/grafana-dashboards/runners.json >/tmp/runners-json-check.json
  • ~/.local/share/mise/installs/go/1.25.10/bin/gofmt -w internal/metrics/metrics.go internal/metrics/metrics_test.go controllers/autoscaler_controller.go controllers/runnerpool_controller.go from infra/runners-controller
  • ~/.local/share/mise/installs/go/1.25.10/bin/go test ./... from infra/runners-controller
  • helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-staging.yaml
  • helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-canary.yaml
  • helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-production.yaml
  • helm template tuist-k8s-monitoring infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-staging.yaml > /tmp/tuist-k8s-monitoring-staging.yaml
  • helm template tuist-k8s-monitoring infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-canary.yaml > /tmp/tuist-k8s-monitoring-canary.yaml
  • helm template tuist-k8s-monitoring infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-production.yaml > /tmp/tuist-k8s-monitoring-production.yaml
  • rg -n "kube_pod_status_unschedulable|__tmp_keep_runner_pod_metric|container_id|integrations/process|integrations/self|node_network_\(receive\|transmit\)_drop_total|traces_service_graph_request" /tmp/tuist-k8s-monitoring-staging.yaml /tmp/tuist-k8s-monitoring-canary.yaml /tmp/tuist-k8s-monitoring-production.yaml
  • rg -n 'macos-runner|__meta_kubernetes_pod_label_tuist_dev_runner|regex = "kube_pod_status_unschedulable;"' infra/helm/k8s-monitoring/values.yaml /tmp/tuist-k8s-monitoring-staging.yaml /tmp/tuist-k8s-monitoring-canary.yaml /tmp/tuist-k8s-monitoring-production.yaml returned no matches, confirming the unverified macos-runner relabel and old non-runner unschedulable drop no longer render.
  • git diff --check

Alloy CLI is not installed locally, so Alloy syntax validation was not run.

Comments
P
pepicrft Jun 19, 2026

@fortmarek addressed in 2deedbabae.

  • Split metric cleanup so the autoscaler only clears autoscaler-owned gauges. Static RunnerPools now keep tuist_runners_pool_phase_replicas{pool,phase} even when autoscaling is missing or disabled.
  • Moved phase publishing to a deferred record after the pod list. The counts are initialized from the listed alive Pods and updated only after successful create/delete mutations, so createRunner or scale-down failures still publish the latest known ready and cold-booting counts before retrying.
  • Removed the annotation autodiscovery instance="macos-runner" relabel from this PR. Stable per-Mac-mini scrape labels stay intact, and any ephemeral runner instance normalization can be added later on the scrape job that is confirmed to emit those series.

Validation rerun: runners-controller go test ./..., dashboard JSON check, k8s-monitoring Helm lint/template for staging/canary/production, rendered rule greps, and git diff --check.