Hive Hive
Sign in

perf(infra): trim Grafana metrics baseline (go/cilium/node/runner)

GitHub issue · Closed

Metadata
Source
tuist/tuist #11420
Updated
Jun 24, 2026
Domains
Compute
Details

Continues the Grafana Cloud metrics cardinality cleanup (follows #11395, #11383, #11337). These rules trim the steady-state active-series baseline that survives the existing allow-lists. Every candidate drop was checked against live Grafana Cloud usage, so nothing on a dashboard or alert is removed.

What changed

All in the k8s-monitoring wrapper values (infra/helm/k8s-monitoring/values.yaml):

  • annotationAutodiscovery excludeMetrics: drop go_.* (the server/processor run on the BEAM, not Go, so this is infra-pod noise only) and the Cilium agent-internal / BPF / datapath / per-endpoint-policy counters that no dashboard or alert reads.
  • hostMetrics.linuxHosts: drop the unqueried node_cpu_seconds_total modes (guest|guest_nice|steal|irq|softirq; dashboards only read mode="idle") and virtual network interfaces (keep only en*/eth*, matching the dashboards’ device=~"e(n|th).*" filter). extraMetricProcessingRules runs after the allow-list keep, so it trims within the kept set.
  • kube-state-metrics: drop kube_pod_status_phase for terminal pods (Succeeded/Failed). We page on container crash/waiting signals, not terminal phase, and no dashboard reads it.
  • cAdvisor excludeNamespaces: [tuist-runners]: same rationale (and same namespace) as the existing kube_pod_* runner drop. The runner dashboard uses low-cardinality RunnerPool telemetry, not per-pod container metrics. (cAdvisor only sees the Kata pods’ sandbox network, so this is network-only series.)

Why / impact

Measured against the live Grafana Cloud usage datasource, these remove ~9k active series from a ~57k baseline (~15%), roughly $70/mo at list rate.

Important context for the reviewer: this is a baseline trim, not the fix for the recent billing spike. Billing is the p95 of concurrently-active series, and the large invoice came from a one-time cardinality explosion (kura_http_request_duration_seconds_bucket picked up a high-cardinality label and hit ~180k series late May / early June, already fixed at the Kura source). The spike washing out of the p95 window is what brings the bill back to baseline; these changes lower the ongoing floor.

Deliberately not dropped (would break node/workload alerting): kube_node_status_condition (KubeNodeNotReady), kube_pod_owner/kube_pod_info (Grafana Kubernetes app workload rollups).

How to test locally

helm dependency build infra/helm/k8s-monitoring
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-production.yaml
helm template k8s-monitoring infra/helm/k8s-monitoring \
-n observability -f infra/helm/k8s-monitoring/values-production.yaml \
| grep -A3 'node_cpu_seconds_total;(guest' # rules render into the Alloy config
Comments

No GitHub comments yet.