Hive
perf(infra): trim Grafana metrics baseline (go/cilium/node/runner)
GitHub issue · Closed
Continues the Grafana Cloud metrics cardinality cleanup (follows #11395, #11383, #11337). These rules trim the steady-state active-series baseline that survives the existing allow-lists. Every candidate drop was checked against live Grafana Cloud usage, so nothing on a dashboard or alert is removed.
What changed
All in the k8s-monitoring wrapper values (infra/helm/k8s-monitoring/values.yaml):
- annotationAutodiscovery
excludeMetrics: dropgo_.*(the server/processor run on the BEAM, not Go, so this is infra-pod noise only) and the Cilium agent-internal / BPF / datapath / per-endpoint-policy counters that no dashboard or alert reads. - hostMetrics.linuxHosts: drop the unqueried
node_cpu_seconds_totalmodes (guest|guest_nice|steal|irq|softirq; dashboards only readmode="idle") and virtual network interfaces (keep onlyen*/eth*, matching the dashboards’device=~"e(n|th).*"filter).extraMetricProcessingRulesruns after the allow-list keep, so it trims within the kept set. - kube-state-metrics: drop
kube_pod_status_phasefor terminal pods (Succeeded/Failed). We page on container crash/waiting signals, not terminal phase, and no dashboard reads it. - cAdvisor
excludeNamespaces: [tuist-runners]: same rationale (and same namespace) as the existingkube_pod_*runner drop. The runner dashboard uses low-cardinality RunnerPool telemetry, not per-pod container metrics. (cAdvisor only sees the Kata pods’ sandbox network, so this is network-only series.)
Why / impact
Measured against the live Grafana Cloud usage datasource, these remove ~9k active series from a ~57k baseline (~15%), roughly $70/mo at list rate.
Important context for the reviewer: this is a baseline trim, not the fix for the recent billing spike. Billing is the p95 of concurrently-active series, and the large invoice came from a one-time cardinality explosion (kura_http_request_duration_seconds_bucket picked up a high-cardinality label and hit ~180k series late May / early June, already fixed at the Kura source). The spike washing out of the p95 window is what brings the bill back to baseline; these changes lower the ongoing floor.
Deliberately not dropped (would break node/workload alerting): kube_node_status_condition (KubeNodeNotReady), kube_pod_owner/kube_pod_info (Grafana Kubernetes app workload rollups).
How to test locally
helm dependency build infra/helm/k8s-monitoring
helm lint infra/helm/k8s-monitoring -f infra/helm/k8s-monitoring/values-production.yaml
helm template k8s-monitoring infra/helm/k8s-monitoring \
-n observability -f infra/helm/k8s-monitoring/values-production.yaml \
| grep -A3 'node_cpu_seconds_total;(guest' # rules render into the Alloy config
No GitHub comments yet.