Hive
fix(infra): keep node-exporter alive under node DiskPressure
GitHub issue · Closed
Draft — the prevention fix for the deploy-wedge hit during a production incident. Small + render-validated.
What
Sets priorityClassName: system-node-critical on the node-exporter DaemonSet (via k8s-monitoring.telemetryServices.node-exporter.priorityClassName).
Why
node-exporter runs at default pod priority, so the kubelet evicts it under node-pressure (DiskPressure). Because node-exporter is a DaemonSet, the observability chart’s helm upgrade --wait blocks until every node has a Ready node-exporter Pod. So the failure chain is:
- One node hits
DiskPressure. - kubelet evicts that node’s (default-priority) node-exporter Pod, repeatedly.
- The
k8s-monitoring-node-exporterDaemonSet is stuck at N-1/N available. - The observability chart’s
helm --waittimes out. - The server deploy job depends on the observability install → every production deploy is blocked on a single bad node.
Not hypothetical — it happened: one wedged bare-metal node took down the entire prod deploy pipeline until the node was manually removed.
system-node-critical pods are exempt from node-pressure eviction, so node-exporter keeps running — and keeps reporting metrics, which is exactly what you want when a node is unhealthy — instead of being the thing that gets reaped and wedges the pipeline. (It’s also what most monitoring charts default node-exporter to.)
Validation
The grafana/k8s-monitoring v4 chart buries node-exporter deep (telemetry-services → prometheus-node-exporter, aliased node-exporter), so the values key is non-obvious. Verified by rendering the actual wrapper chart:
$ helm template test infra/helm/k8s-monitoring -f values.yaml -f values-production.yaml | grep ...
found on node-exporter DS: priorityClassName: system-node-critical
- Renders correctly against both staging and production values.
helm lintclean.
Follow-up (not here)
- kura PVC disk-usage alerting (the disk-bloat that triggered the original incident) — separate change.
No GitHub comments yet.