Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

every exporter i installed and the one metric i should have started with

A homelab monitoring stack that grew exporters and dashboards faster than understanding, and the single number that would have mattered more.

A rack of servers with status lights

My homelab monitoring grew the way these things do: one exporter at a time, each one perfectly reasonable on its own, until I had a Prometheus scraping fourteen targets and a Grafana with more dashboards than I had services. node-exporter, cadvisor, smartctl-exporter, a SNMP exporter for the switch, blackbox for uptime checks, a ZFS one, a UPS one. Every single one earned its place at the moment I added it. Together they added up to a monitoring stack I no longer understood.

The symptom of too many dashboards is not that you have too many dashboards. It is that when something breaks, you do not know which one to open. You sit there at the index page, scrolling through a folder of beautifully-titled panels, trying to remember which one shows the thing that is actually wrong. That hesitation, that "where do I even look" moment, is the real cost. Monitoring is supposed to shorten the distance between "something is wrong" and "ah, that". Mine had lengthened it.

Grafana panels on a homelab monitor

The reframe that helped was thinking about symptoms versus causes. Most of my dashboards were cause dashboards. CPU per container, IO wait per disk, packets per interface. Useful once you know roughly where to dig, useless as a starting point, because they assume you already know which subsystem is at fault. What I was missing was a symptom view: a single screen that answers "is anything actually broken from where I sit", phrased in terms of the things I care about rather than the components they run on.

So I built one overview dashboard with about eight panels, all of them user-facing or close to it. Is each service responding to a blackbox probe. Is any disk above 85%. Is any host I expect up actually down. Did last night's backup finish. Is the UPS on mains. Nothing per-container, nothing per-interface, nothing that requires me to already know the answer. If everything on that one screen is green, I close the laptop. If something is red, that is the only moment I go digging into the cause dashboards, which I kept but tucked away.

The deeper point is that I had been measuring everything I could measure instead of the few things I would actually act on. The exporters were not the mistake. Collecting the data is cheap and occasionally a buried metric saves you. The mistake was letting collection masquerade as monitoring, and confusing a wall of graphs with the ability to answer one simple question quickly. These days I add an exporter when I have a question, not because it exists, and I keep exactly one dashboard I am willing to be woken by.