Too Many Dashboards, and the Slow Walk Back

A server rack with blinking lights

At some point last year I counted the dashboards in my Grafana instance and got to thirty-four. Thirty-four. For a homelab that is, on a busy day, three people watching films and a handful of containers shuffling backups around. I had a dashboard for the dashboards' own scrape latency. I am not proud of this, but I suspect I'm not alone, so here is the honest account of how it happened and how I clawed it back to something I actually use.

How it got out of hand

The pattern was always the same. I'd add a new service, the service had a Prometheus exporter, the exporter shipped with a community dashboard, and importing that dashboard was one click. So I clicked. Each one looked impressive on the day, full of gauges and heat maps and panels labelled things like "p99 GC pause" that I could not have acted on if my life depended on it.

The trouble with importing dashboards is that they're built to demo well, not to be lived with. A vendor wants their exporter to look comprehensive, so the default dashboard shows everything the exporter can possibly emit. That's the opposite of what you want at 2am when something's broken and you need the one number that tells you whether it's the disk, the network, or the application.

So I had thirty-four beautiful dashboards and, when something actually went wrong, I'd still end up SSHing in and running htop like it was 2009.

A homelab shelf of mismatched gear

The question that fixed it

The thing that turned it around wasn't a tool. It was a question I started asking before adding any panel: what would I do differently if this number changed?

If the answer was "nothing", the panel went. Not hidden, deleted. It's astonishing how much of a typical dashboard fails that test. Per-core CPU breakdowns are fascinating and almost never actionable on a homelab. Network packets-per-second is a lovely sawtooth and tells me nothing I'd act on. Cache hit ratios on a service that has never once been slow are just decoration.

What survived was a much smaller set. I now keep three dashboards, and I mean keep, as in look at:

Overview. Is everything up, is anything on fire, are the disks filling. One screen, big numbers, red when bad. This is the only one I have on a wall display.
Per-host. CPU, memory, disk, and the single most important per-host detail, which is temperature, because this gear lives in a cupboard and summer is a real threat.
The annoying one. Whatever's currently misbehaving gets a focused dashboard while I'm debugging it, and that dashboard gets deleted when the problem's solved. It's scaffolding, not furniture.

Alerts do the watching now

The deeper realisation was that a dashboard is a terrible primary signal. A dashboard only works if a human is looking at it, and I am usually not looking at it, because I have a life and the entire point of automating this stuff was to get that life back.

So the real monitoring moved into alerting rules. The dashboards are now for context once an alert has already told me something's wrong. A representative rule, in Prometheus' alerting syntax, looks like this:

groups:
  - name: homelab
    rules:
      - alert: DiskFillingUp
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/data"}[6h], 7*24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} /data will fill within a week"

predict_linear is the bit I wish I'd found years earlier. Instead of alerting when a disk is 90% full, which on a slowly-filling array is both too early and too late, it fits a line to the last six hours and warns me if the trend says I've got under a week. That gives me a calm Saturday to deal with it rather than a panicked Tuesday. The alert lands in a Telegram channel, and that's it. No dashboard required to find out something's wrong. The dashboard is only for working out why.

What I'd tell past me

The mistake wasn't collecting too much data. Storage is cheap and Prometheus is happy to hold months of it, and having the history when you do need to investigate is genuinely valuable. The mistake was confusing collecting data with monitoring. I'd built a museum and called it an observability stack.

If you're staring at your own wall of panels wondering why none of it ever helps in a crisis, try the question. Walk each panel and ask what you'd do differently if it changed. Be honest. Delete the ones that don't survive. You'll end up with far fewer dashboards and, oddly, far more confidence, because the handful that remain are the handful that have actually earned a place on the screen. The rest were just nice to look at, and nice to look at is not the job.