i instrumented everything and then drowned in dashboards

A wall of monitoring graphs on a screen

It started, as these things do, with one genuinely useful dashboard. Prometheus scraping node-exporter, a Grafana panel showing CPU and memory across the homelab, and the quiet satisfaction of a graph that turned red exactly once before a disk filled and I was glad it had. From there it was a slippery slope made entirely of good intentions.

Within a few months I had a dashboard for every service. One for the NAS, one for the reverse proxy, one for each container's resource usage, one for the network gear via SNMP, one for the UPS, one for the home temperature sensors because why not. Grafana's folder list scrolled. I felt observably observant. And then I noticed something uncomfortable: I never actually looked at any of them unless something was already broken, and when something broke I didn't know which of the forty to open.

the dashboard you check vs the dashboard you keep

The distinction that finally helped was separating the dashboard I check from the data I keep. Keeping the metrics is cheap and worth doing, Prometheus will happily hoard them, and when I'm debugging at 11pm it's lovely to have history rather than wishing I'd been recording. But a dashboard is a deliberate, curated view that's meant to answer "is everything alright" at a glance, and you can only really have one or two of those before the glance stops working.

So I built one overview. Is anything down. Is anything about to run out of disk. Is anything pegged. Four or five panels, traffic-light obvious, the thing I actually pin as my home page. Everything else got demoted to "exists, queryable, not on the wall."

A homelab corner with cabling and a small rack

let the alerts do the watching

The other half of the cure was admitting that staring at graphs is not monitoring. Alerting is. A dashboard you have to remember to look at will be looked at precisely when you least have time. So I moved the actual watching into Alertmanager: disk above 85%, a service's up metric at zero for more than a couple of minutes, the UPS reporting it's on battery. A handful of rules that ping my phone, written so they fire on conditions I genuinely want woken up for and almost nothing else.

- alert: DiskFillingUp
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.instance }} {{ $labels.mountpoint }} below 15% free"

The result is calmer than the forty-dashboard era ever was. The graphs are still there when I need to go spelunking through history, which is the right time to look at a graph. The rest of the time, nothing is asking for my attention, which is exactly what good monitoring should feel like: silence, until it shouldn't be silent. Turns out the goal was never more dashboards. It was needing to look at fewer of them.