Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

i built fourteen dashboards and look at none of them

A confession about over-instrumenting a homelab, and the handful of panels that actually earn their place on the wall.

A server rack with blinking lights

At some point this summer I realised I had fourteen Grafana dashboards and looked at exactly none of them. Not occasionally. None. They existed because building dashboards is fun and watching them is not, and nobody warns you that the ratio is so lopsided.

It started reasonably. Prometheus scraping node-exporter, a tidy panel for CPU and memory per host, disk usage with a sensible threshold. That earns its keep. Then I added cAdvisor for the containers, then a dashboard for the NAS, then one for the Pi-hole, then one for "network" that was mostly an excuse to draw pretty graphs of throughput I never act on. Each new service came with a community dashboard you import by pasting an ID, and importing is so frictionless that I imported things for daemons I'd forgotten I was running.

A homelab shelf of mismatched hardware

The problem isn't the data. The data is great. The problem is that a wall of forty panels conveys nothing. When everything is on screen, your eye has nowhere to land, and a graph that's always green trains you to stop looking at it. I'd built an elaborate apparatus for ignoring my own infrastructure.

So I did the unglamorous thing and asked, per dashboard: when did this last change my behaviour? Not "when did I look at it", which flatters everything, but when did looking at it cause me to do something differently. If the honest answer was "never", it went. That cleared eleven of them, and deleting a dashboard you spent an evening building is a small grief I recommend getting over quickly.

What survived was small and a bit boring. One overview with host up/down, disk headroom, and memory pressure, the three things that have ever actually woken me up. One panel for the UPS, because losing power without warning is the one failure I cannot debug after the fact. And crucially, a single Alertmanager rule that pages me when disk crosses 85 percent, so I don't need to be looking at all. That last bit is the real lesson: a good alert is worth a dozen dashboards, because it inverts the relationship. The system watches itself, and I only show up when there's something to do.

There's a mindset shift hiding in there. A dashboard is a pull: it asks you to remember to look, and humans are dreadful at remembering to look at things that are usually fine. An alert is a push: it asks nothing of you until there's something to act on. I'd quietly built a whole system on the worst of the two models and then blamed myself for not watching it. The fix wasn't discipline, it was deleting the thing that required discipline.

The dashboards I keep now, I keep for investigation, not for surveillance. When the alert fires, I open the graph to find out why. The rest of the time the screen is off, which turns out to be the healthiest state a monitoring stack can be in.