At some point this year my homelab acquired fourteen Grafana dashboards. Prometheus scraping everything, node_exporter on every box, a panel for disk IO I genuinely cannot remember the purpose of. It looked magnificent. The TV in the office cycled through CPU graphs like a tiny ops centre, and I felt very professional indeed.
Then the NAS filled up overnight and I found out at lunchtime, from the application falling over, not from any of the fourteen dashboards. Because of course I wasn't looking at them. Nobody looks at fourteen dashboards. You glance at the pretty one with the graphs that move, and the disk-usage panel that would have told you sat three tabs deep, unwatched, doing its job perfectly into the void.
The fix wasn't another dashboard. It was deleting most of them and writing two Alertmanager rules that actually push at me: disk over 85%, and any target that stops responding for five minutes. A message to my phone is worth more than a wall of graphs, because the graph needs me to go and look and the alert comes to find me.
- alert: DiskFillingUp
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 10m
labels:
severity: warning
Dashboards are for understanding a problem you already know you have. Alerts are for telling you that you have one. I'd built fourteen of the first and zero of the second, then wondered why I kept getting surprised.