Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

i built a dashboard for everything and could see nothing

How my homelab monitoring grew into forty Grafana panels nobody read, and the cull that made it useful again.

A server rack with monitoring dashboards on a nearby screen

It started, as these things do, with a single Grafana dashboard and the best of intentions. Prometheus scraping the hosts, node_exporter handing over CPU and memory and disk, a tidy little wall of graphs to glance at over coffee. It was lovely. I felt like a proper operations team of one.

Eighteen months later I had something closer to a mission control display, and it was worse than useless. Worse, because a dashboard you cannot read at a glance does not just fail to help, it actively trains you to stop looking. I had built the monitoring equivalent of a hoarder's spare room: everything I might ever need, arranged so that I could find none of it.

how it got out of hand

Every time something went wrong, my instinct was to add a panel so I would "see it next time." A disk filled up: add a disk panel. A container restarted: add a restart-count panel. An odd network spike: add three network panels, because why not, the data was already there.

None of those decisions was wrong on its own. Collectively they produced a dashboard with somewhere past forty panels across a scroll I had to drag through, where the genuinely important signals, is anything actually down, are we about to run out of disk, sat shoulder to shoulder with curiosities I had added during one incident in 2018 and never looked at since. The signal-to-noise ratio had quietly inverted. When everything is on the dashboard, nothing is on the dashboard.

A home lab with a monitoring display showing many graphs

The breaking point was an actual outage where a service had been down for a couple of hours before I noticed, despite the relevant graph being right there on the wall. It was right there, and it was lost amongst thirty-nine other things that were all completely fine, so my eye slid past it. The dashboard had so much information that it conveyed none.

the cull

I did the obvious thing far too late: I deleted most of it. Not archived, deleted, because a panel I am "keeping just in case" is a panel I will scroll past during the next incident. The data still lives in Prometheus regardless; I can always build a panel back when I genuinely need to investigate something. The dashboard is not the data store. The dashboard is the thing I look at when I have ten seconds and a coffee.

I reorganised around a simple idea I should have started with: separate the looking from the alerting, and separate the overview from the detail.

The alerting is the important half, and it does not live on a screen at all. Alertmanager pages me when something is actually wrong: a host down, a disk that will fill within the day at its current rate, a service that has stopped responding. The whole point of an alert is that I do not have to be looking. If a condition matters enough to act on, it should find me, not wait for me to notice it on a wall I have stopped watching.

groups:
  - name: homelab
    rules:
      - alert: DiskWillFillSoon
        expr: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} disk fills within 24h at current rate"

That predict_linear rule is the single most useful thing in the whole setup, because it tells me about a problem before it is a problem. A disk at ninety percent is not interesting. A disk that, at its current rate of growth, hits full within a day, that is worth a notification. A static threshold cannot tell those two apart: a disk that has sat happily at ninety-two percent for a year and a disk climbing through ninety on its way to full by tea-time look identical to a simple > 90 rule, and one of them is fine while the other is about to ruin my evening. The rate of change is the signal. The absolute number rarely is.

I had to learn the same lesson about thresholds in general. My early alerts fired on instantaneous values, so a thirty-second CPU spike during a backup would page me at two in the morning for something that had already resolved by the time I read the message. The for: 30m clause in that rule is doing quiet, important work: it says do not bother me unless the condition has held long enough to be real. An alert that cries wolf is worse than no alert, because the cost is not just the false page, it is that I start ignoring the channel, and then the true alert arrives and lands in a folder I have trained myself not to read.

What is left on the actual dashboards is deliberately small. One overview screen: are the hosts up, are the core services responding, the aggregate CPU and memory and disk for the cluster, and nothing else. If that screen is all green I am done, I close the tab. Then a handful of detail dashboards, one per significant service, that I only open when an alert has already told me where to look. They can be as dense as they like, because by the time I am reading one I have a specific question.

what i actually learned

The mistake was treating "I can graph this" as a reason to graph it. The data being available is not the same as the data being worth a permanent place in front of my eyes. Collection should be greedy: scrape everything, keep it, you never know what you will need to query during an incident. Display should be stingy: show me only what changes a decision.

The honest measure of a monitoring setup is not how much it shows you. It is whether it tells you the one thing you need to know before you have to go looking for it. Mine does that now, with about a tenth of the panels and a few good alerts, and I trust it again, which is the part that actually matters.