Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

too many dashboards and not enough alerts

A homelab that ended up with a wall of beautiful Grafana dashboards nobody looked at, and the realisation that an alert beats a dashboard every time.

A server rack with blinking lights

At some point my homelab monitoring stopped being monitoring and became decoration. I had Prometheus scraping everything that would hold still, node exporter on every box, cAdvisor on the Docker hosts, an SNMP exporter talking to the switch, and a Grafana with, I counted, fourteen dashboards. Temperatures, fan speeds, per-container memory, disk IO latency, the lot. It looked magnificent on the spare monitor. It was also completely useless, and it took a failure to make me admit it.

The failure was a disk filling up. Not dramatically, just the slow creep of a logging volume nobody had set a retention policy on. And I had a dashboard for it! A lovely gauge, green fading to amber fading to red. The gauge had been sitting at amber for a fortnight. I knew this because, after the volume hit 100% and a service fell over, I went and looked at the dashboard and there it was, the whole sad story drawn out in a colour I had simply never been in the room to see.

A homelab dashboard wall

That's the thing about a dashboard. It only works if a human is looking at it, at the moment it matters. Mine were a pull model for attention, and my attention is not reliable. I am not sitting at the spare monitor at 2am when the disk crosses 90%. The dashboard was answering a question nobody was asking, beautifully.

So I changed the model. The dashboards stayed, because they're genuinely useful when you're already investigating something and want to see the shape of it. But I stopped pretending they were monitoring. The actual monitoring moved into Alertmanager, where the system pushes to me instead of waiting for me to pull. The rule for the disk that started all this is about as simple as it gets:

- alert: DiskFillingUp
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.instance }} {{ $labels.mountpoint }} below 15% free"

The for: 30m matters more than it looks. Without it I'd get paged every time a backup briefly spiked a volume, and an alert that cries wolf is one I'll mute, at which point I'm back to the dashboard nobody watches. With it, I only hear about a disk that's been genuinely low for half an hour, which is a real problem and not a blip.

The discipline I'm trying to hold to now is that a metric earns a dashboard panel freely, but it only earns an alert if I can answer one question: what would I do at 2am if this fired? If the honest answer is "nothing, I'd look at it in the morning", it is not a 2am alert, maybe not an alert at all. Most of my fourteen dashboards' worth of metrics fail that test, and that's fine. They can be scenery. Scenery is allowed, as long as I stop mistaking it for a smoke detector.

The lab is quieter now, in the good sense. Fewer things demand that I look at them, and the few that page me have actually earned it. I still have too many dashboards. I've just stopped expecting them to do a job they were never able to do, which is to notice things while I'm asleep.