Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

i had forty dashboards and no idea if anything was broken

How a homelab Prometheus and Grafana setup grew into dozens of dashboards nobody actually watched, and the deliberate trim back to a handful of alerts that mean something.

A rack with more monitoring than it strictly needs

At some point my homelab monitoring crossed over from "useful" into "hobby that monitors a hobby". I had Prometheus scraping everything that would hold still long enough to expose a /metrics endpoint, node_exporter on every box, cAdvisor for the containers, blackbox probes for the services, and Grafana on top with, at the high-water mark, something like forty dashboards.

It looked magnificent. Walls of graphs, every one of them green and twitching. And one weekend a service was down for the better part of a day before I noticed, because the dashboard that would have shown it was the thirty-first one and I never opened it. That is the whole problem with monitoring everything: the more you watch, the less any single thing is actually watched.

The realisation was simple. A dashboard is something you go and look at. An alert is something that comes and finds you. I had built a vast amount of the former and almost none of the latter, which meant my monitoring only worked when I was already worried, which is precisely when you least need it.

A grid of dashboards nobody is actually looking at

So I did the unglamorous work of deciding what failure actually looks like, and wrote alerts for that instead. The list is short on purpose:

  • a service stops responding to its health check for more than a couple of minutes
  • a disk is on track to fill within a day
  • a host stops reporting metrics at all
  • the backup job did not run, or ran and failed

That is mostly it. Each one is a thing that, left alone, ruins a weekend. Each one pages me through Alertmanager into a single channel, with enough context in the message to know whether to get up or finish my tea first.

A rule like this is the entire point, expressed in about four lines:

- alert: HostDown
  expr: up == 0
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.instance }} has stopped reporting"

The dashboards did not all go away, and they should not. Graphs are how you understand a problem once you know you have one, which is exactly when you do open the thirty-first dashboard, with a reason. What changed is that I no longer rely on a human happening to glance at the right panel at the right moment. The system tells me when something is wrong, and the dashboards are there to help me work out why.

I trimmed from forty dashboards to a handful I actually use and about half a dozen alerts that actually mean something. The lab feels less impressive and considerably more reliable, which turns out to be the trade I wanted all along. Monitoring is not about how much you can see. It is about how little has to go wrong before you find out.