i built a wall of dashboards and stopped looking at any of them

A rack of homelab servers with blinking lights

At some point my monitoring became a hobby of its own, separate from the things it was meant to be monitoring. I had Prometheus scraping everything, Grafana drawing it all, and roughly a dozen dashboards covering hosts, containers, the network, disk, the UPS, temperatures, even a panel for the power draw of the whole rack. It looked magnificent. It was, in practical terms, useless, because a wall of graphs nobody reads is just expensive wallpaper.

The moment I noticed was when something actually broke. A disk was filling up, slowly, over about a week. I had a dashboard for that. I had a beautiful gradient-coloured gauge for that. And I sailed right past it, every day, because when everything is on screen nothing stands out. The graph was green-ish, the way it always was, until it wasn't and a service fell over. I'd built a system that could show me the problem but would never tell me about it.

A homelab in a cupboard, cables everywhere

So I did the unglamorous thing and started deleting. The rule I settled on: a dashboard exists to answer a question I'd actually ask, and an alert exists to interrupt me when I'd want interrupting. Everything else is a graph I can build on demand when I'm investigating, not something that needs a permanent home on a screen.

The dashboards collapsed down to two. One overview that shows me, at a glance, whether anything is on fire: a handful of red-or-green status panels and nothing with a gradient. And one detail dashboard with template variables, so I pick a host or a service from a dropdown and drill in only when I'm chasing something specific. The other ten went in the bin, and I have not missed a single one.

The real work, though, was moving the intelligence from the dashboards into alerting rules. The disk-filling-up problem became an alert in Alertmanager that fires on the trend, not the absolute number:

- alert: DiskWillFillSoon
  expr: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.instance }} {{ $labels.mountpoint }} will fill within 24h"

predict_linear is the bit I wish I'd used a year earlier. It takes the recent slope and projects it forward, so instead of paging me when the disk is already at 95 per cent it tells me a day ahead that the trend lands at zero. That's the difference between a problem I fix on Tuesday afternoon with a cup of tea and a problem that wakes me on Saturday night.

The principle that fell out of all this is one I keep relearning: monitoring is for answering questions, alerting is for the questions you didn't know to ask. Dashboards are pull, you go and look. Alerts are push, they come and find you. I'd built a magnificent pull system and almost no push, then wondered why I kept getting surprised. Now there's a lot less on screen and a lot more that taps me on the shoulder, and the rack runs quieter in my head as a result.