Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

i built nineteen dashboards and looked at two

A confession about over-instrumenting a homelab with Prometheus and Grafana, and pruning back to the handful of panels that actually answer questions.

A rack of servers with monitoring screens

At some point my homelab acquired more dashboards than services. I'd stand this up to be sensible, you understand: Prometheus scraping everything, node-exporter on every box, Grafana with a folder structure, alerting rules, the lot. The kind of observability setup you'd put on a slide. And then one evening I counted, and there were nineteen dashboards, and I realised I look at precisely two of them.

The other seventeen were built in moments of enthusiasm. A new service goes in, I import the community dashboard for it, admire the wall of green, and never open it again. Some of them are genuinely beautiful. There is a Grafana dashboard for my UPS that shows battery temperature trends over ninety days. I have never once needed to know the ninety-day temperature trend of my UPS battery. I built it because the panel existed and importing it was free.

A homelab with a screen of graphs

The two I actually use tell me different things and that's why they survive. The first is a single overview: is everything up, is anything on fire, what's the disk situation. Four panels, no scrolling, the thing I glance at with coffee. The second is the one I open when something is already wrong, dense and ugly, full of the per-service detail I need to actually diagnose. One answers "is it fine?", the other answers "what broke?". Everything in between turned out to be decoration.

The mistake I'd made is a common one, and it's seductive because instrumenting is satisfying in a way that staring at instruments isn't. A dashboard you build feels like progress. A metric you collect feels like diligence. But a dashboard nobody reads isn't observability, it's a screensaver, and the cost is real: every panel is a query, every query is load on Prometheus, every dashboard is a thing that breaks silently when a metric name changes and that you then have to maintain or delete.

So I did a cull. The test I used was blunt: when did I last open this, and what question was I trying to answer? If I couldn't name the question, the dashboard went. Seventeen became four. The alerting got the same treatment, because an alert that fires and gets ignored is worse than no alert at all, it just trains you to ignore the next one. I kept the alerts that would actually get me out of my chair: a disk filling, a service down, the backup job failing. The rest were noise dressed up as vigilance.

What's left is smaller and I trust it more. The overview dashboard is honest because it's small enough to take in at a glance. The diagnostic one is there when I need it and ignored when I don't, which is exactly right. The metrics still get collected, Prometheus is cheap to store, so the data's there if I ever do need that UPS temperature trend. I just stopped pretending I was going to look at it.

The lesson, if there is one, is that observability is about the questions you ask, not the data you hoard. Collect broadly, by all means. But build a dashboard only when you have a question it answers, and delete it the moment you stop asking. I'd rather have two dashboards I read than nineteen I don't.