Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

i built fourteen dashboards and looked at none of them

How my homelab Grafana sprawled into a dozen pretty dashboards I never actually watched, and the one alert that did more than all of them combined.

A server rack with cabling, lit by status LEDs

At some point this year my homelab acquired fourteen Grafana dashboards. Prometheus scraping everything, node_exporter on every box, a panel for disk IO I genuinely cannot remember the purpose of. It looked magnificent. The TV in the office cycled through CPU graphs like a tiny ops centre, and I felt very professional indeed.

Then the NAS filled up overnight and I found out at lunchtime, from the application falling over, not from any of the fourteen dashboards. Because of course I wasn't looking at them. Nobody looks at fourteen dashboards. You glance at the pretty one with the graphs that move, and the disk-usage panel that would have told you sat three tabs deep, unwatched, doing its job perfectly into the void.

The fix wasn't another dashboard. It was deleting most of them and writing two Alertmanager rules that actually push at me: disk over 85%, and any target that stops responding for five minutes. A message to my phone is worth more than a wall of graphs, because the graph needs me to go and look and the alert comes to find me.

- alert: DiskFillingUp
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
  for: 10m
  labels:
    severity: warning

Dashboards are for understanding a problem you already know you have. Alerts are for telling you that you have one. I'd built fourteen of the first and zero of the second, then wondered why I kept getting surprised.