At some point I counted the dashboards in my Grafana and it was forty. Forty. For a homelab that, on a busy day, serves DNS and streams a film. I had a dashboard for ZFS pool health, a dashboard for per-container CPU, a dashboard for the UPS, a dashboard for SMART attributes on disks I had already replaced. I had imported community dashboards by the dozen, each one a beautiful wall of panels for some piece of software I ran, and I opened maybe three of them with any regularity.
The trap is that adding a dashboard feels like progress. You wire up an exporter, you import a JSON file with four thousand lines in it, and suddenly there are graphs. Graphs feel like control. But a dashboard you never look at is not monitoring. It is decoration that happens to cost you scrape budget and disk.
The thing that broke the spell was an actual incident. My NAS filled up and I did not notice for two days, because the disk-usage panel lived on a dashboard I had not opened since I made it. I had the data. Prometheus had been faithfully recording the climb the whole time. I just was not looking, because no human looks at forty dashboards, and nothing told me to.
So I did the obvious thing I should have done at the start. I stopped trying to watch and started getting told. The question is not "what could I look at" but "what would I want to be woken by", and that is a much shorter list. Disk above 85%. A pool degraded. A host I expect to be up being down. A backup job that did not finish. That is most of it.
In practice that meant leaning on Alertmanager instead of my own attention span. A handful of rules, routed to a Telegram chat, with sensible thresholds and a for: clause so a brief blip does not page me:
- alert: DiskFillingUp
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 30m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} {{ $labels.mountpoint }} below 15% free"
Then I culled the dashboards. Not deleted, just demoted. I kept two that I actually use: one overview with the handful of numbers I care about across every host, and one "something is wrong, where" drill-down that I only open when an alert has already told me to. Everything else got moved into a folder called attic so it is there if I ever need it, and out of my face if I do not.
The lesson, which I keep relearning in different costumes, is that observability is not about how much you can see. It is about whether the right thing reaches you when you are not looking. A good monitoring setup spends most of its life being silent, and then says exactly one useful sentence at the worst possible moment. Forty dashboards said nothing, very prettily, for two days, whilst my disk quietly filled.