i built forty graphs and looked at none of them

A server rack with blinking lights in a home cupboard

I went through a phase, and I suspect it is a phase most people with a homelab pass through, of monitoring absolutely everything. Prometheus scraping every box. Node exporter on each one. Grafana stuffed with dashboards. CPU, memory, disk, network, temperatures, fan speeds, the lot, all rendered in lovely coloured lines. It looked magnificent on the spare monitor I had propped up for the purpose.

It was also completely useless, and it took me a while to admit it.

The problem was not the data. The problem was that forty dashboards is the same as zero dashboards, because you cannot watch forty things and you certainly cannot watch them all the time. I had built a beautiful read-only museum of metrics that I glanced at when I was bored and ignored entirely when something actually broke. When a disk filled up one evening, I did not find out from my magnificent wall of graphs. I found out because a service stopped responding and I went looking. The graph confirming the disk was full had been sitting there the whole time, on a dashboard I had not opened in a fortnight.

A wall of Grafana panels glowing on a screen nobody is watching

So I did the thing that felt like vandalism and deleted most of it. Not the data collection, that costs nothing to keep, but the dashboards. I kept one overview that fits on a single screen and shows the half-dozen numbers that actually mean something: disk free on the boxes that fill up, memory on the box that leaks, whether the things that should be up are up. Everything else I left in Prometheus to be queried when I have a question, which is the only time a detailed graph is worth anything.

The real shift was moving from looking to being told. A dashboard is a pull model: it relies on me remembering to look, and I will not. An alert is a push model: it finds me. So the rules that matter now live in Alertmanager, not in panels. Disk above 85 percent, tell me. A service down for more than a couple of minutes, tell me. The backup job that did not run last night, tell me. A handful of rules, each tied to something I would actually want woken up for, or at least nudged about over a quiet notification.

- alert: DiskFillingUp
  expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.15
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "{{ $labels.instance }} root disk under 15% free"

The test I now apply to any new panel is blunt: if this line goes bad, what would I do, and could an alert have told me instead? If the answer is "nothing" or "an alert", the panel does not get built. Monitoring is not about seeing everything. It is about being told the few things that need doing, before they become the thing you find out about the hard way. My spare monitor shows the weather now. It is more use.