fix(infra): keep node-exporter alive under node DiskPressure

Metadata

Source

tuist/tuist #11415

Updated

Jun 24, 2026

Domains

Atlas

Details

Draft — the prevention fix for the deploy-wedge hit during a production incident. Small + render-validated.

What

Sets priorityClassName: system-node-critical on the node-exporter DaemonSet (via k8s-monitoring.telemetryServices.node-exporter.priorityClassName).

Why

node-exporter runs at default pod priority, so the kubelet evicts it under node-pressure (DiskPressure). Because node-exporter is a DaemonSet, the observability chart’s helm upgrade --wait blocks until every node has a Ready node-exporter Pod. So the failure chain is:

One node hits DiskPressure.
kubelet evicts that node’s (default-priority) node-exporter Pod, repeatedly.
The k8s-monitoring-node-exporter DaemonSet is stuck at N-1/N available.
The observability chart’s helm --wait times out.
The server deploy job depends on the observability install → every production deploy is blocked on a single bad node.

Not hypothetical — it happened: one wedged bare-metal node took down the entire prod deploy pipeline until the node was manually removed.

system-node-critical pods are exempt from node-pressure eviction, so node-exporter keeps running — and keeps reporting metrics, which is exactly what you want when a node is unhealthy — instead of being the thing that gets reaped and wedges the pipeline. (It’s also what most monitoring charts default node-exporter to.)

Validation

The grafana/k8s-monitoring v4 chart buries node-exporter deep (telemetry-services → prometheus-node-exporter, aliased node-exporter), so the values key is non-obvious. Verified by rendering the actual wrapper chart:

$ helm template test infra/helm/k8s-monitoring -f values.yaml -f values-production.yaml | grep ...
  found on node-exporter DS:  priorityClassName: system-node-critical

Renders correctly against both staging and production values.
helm lint clean.

Follow-up (not here)

kura PVC disk-usage alerting (the disk-bloat that triggered the original incident) — separate change.

Comments

No GitHub comments yet.