Hive Hive
Sign in

fix(infra): keep node-exporter alive under node DiskPressure

GitHub issue · Closed

Metadata
Source
tuist/tuist #11415
Updated
Jun 24, 2026
Domains
Atlas
Details

Draft — the prevention fix for the deploy-wedge hit during a production incident. Small + render-validated.

What

Sets priorityClassName: system-node-critical on the node-exporter DaemonSet (via k8s-monitoring.telemetryServices.node-exporter.priorityClassName).

Why

node-exporter runs at default pod priority, so the kubelet evicts it under node-pressure (DiskPressure). Because node-exporter is a DaemonSet, the observability chart’s helm upgrade --wait blocks until every node has a Ready node-exporter Pod. So the failure chain is:

  1. One node hits DiskPressure.
  2. kubelet evicts that node’s (default-priority) node-exporter Pod, repeatedly.
  3. The k8s-monitoring-node-exporter DaemonSet is stuck at N-1/N available.
  4. The observability chart’s helm --wait times out.
  5. The server deploy job depends on the observability install → every production deploy is blocked on a single bad node.

Not hypothetical — it happened: one wedged bare-metal node took down the entire prod deploy pipeline until the node was manually removed.

system-node-critical pods are exempt from node-pressure eviction, so node-exporter keeps running — and keeps reporting metrics, which is exactly what you want when a node is unhealthy — instead of being the thing that gets reaped and wedges the pipeline. (It’s also what most monitoring charts default node-exporter to.)

Validation

The grafana/k8s-monitoring v4 chart buries node-exporter deep (telemetry-servicesprometheus-node-exporter, aliased node-exporter), so the values key is non-obvious. Verified by rendering the actual wrapper chart:

$ helm template test infra/helm/k8s-monitoring -f values.yaml -f values-production.yaml | grep ...
found on node-exporter DS: priorityClassName: system-node-critical
  • Renders correctly against both staging and production values.
  • helm lint clean.

Follow-up (not here)

  • kura PVC disk-usage alerting (the disk-bloat that triggered the original incident) — separate change.
Comments

No GitHub comments yet.