i monitored everything and learned almost nothing

A server rack with cabling and blinking status lights

I have a confession. At one point my homelab had more dashboards than it had services worth dashboarding. Eleven Grafana dashboards, a Prometheus scraping forty-odd targets, node_exporter on everything including a Raspberry Pi whose entire job was to run node_exporter. I could tell you the per-core temperature of a machine that did nothing. What I could not tell you, when something actually broke, was why.

This is the homelab monitoring trap, and I walked straight into it. Observability is fun to build. Every new exporter is a little dopamine hit: another row of green, another graph that wiggles satisfyingly. The problem is that building monitoring and using monitoring are different skills, and the first one is far more enjoyable than the second.

the wall of green

For about a year my "overview" dashboard was a single screen with thirty-six panels on it. CPU, memory, disk, network, per host, all crammed into a grid you needed a 4K monitor to read. It looked magnificent. It looked like a NASA control room. It was completely useless.

The thing nobody tells you is that a panel showing "everything is fine" 99% of the time trains your eye to ignore it. By the time something went red I had long since stopped looking. The signal drowned in its own background. I once had a disk fill up over three days and only noticed because a service started throwing errors, not because the graph that was literally tracking disk usage was sat there going up and to the right the whole time.

A dense wall of monitoring graphs on a screen

The breaking point was a power cut. Everything came back up, mostly, and I spent twenty minutes clicking between dashboards trying to work out what had not recovered. The monitoring that was supposed to answer "is everything healthy?" required me to manually inspect eleven different views to find out. That is not monitoring. That is a hobby that occasionally produces graphs.

the cull

So I deleted things. This was harder than building them, emotionally. Every exporter felt like an achievement I was throwing away. But I started with one question: when I am woken at 3am, what do I actually need to know?

The answer was short. Is each service reachable? Is any disk about to fill? Is anything in a crash loop? Is the house too warm because a fan died? Four questions. Not thirty-six panels.

I rebuilt around those. One dashboard, one screen, designed so that if everything is fine it is almost entirely blank. Boring is the goal. A panel only earns its place if a human would act on it changing. The per-core temperature of the NAS does not change my behaviour, so it went into a "deep dive" dashboard I open roughly never, which is the correct frequency.

The alerting got the same treatment. I had been firing alerts for CPU above 80%, which on a homelab is just "a thing is doing work". I deleted nearly all of it and kept the alerts that map to "a human needs to do something tonight". Here is the rough shape of what survived, in Alertmanager terms:

groups:
  - name: things-that-matter
    rules:
      - alert: DiskWillFillSoon
        expr: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: page
        annotations:
          summary: "{{ $labels.instance }} {{ $labels.mountpoint }} fills within 24h"
      - alert: ServiceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
      - alert: ContainerRestarting
        expr: rate(container_start_time_seconds[15m]) > 0
        for: 10m
        labels:
          severity: warn

The predict_linear one is the change that paid for itself immediately. A static "disk above 90%" alert either fires too early on a disk that sits happily at 91% forever, or too late on one that goes from 60% to full in an afternoon. Asking "will this fill in the next 24 hours at the current rate?" matches how I actually think about it. It caught the next slow leak with a full day to spare, which is the first time my monitoring told me something before the service did.

what i actually learned

The lesson was not "monitoring is bad" or "Prometheus is too much". Prometheus is excellent and I would not run a lab without it. The lesson was that a dashboard is a question, and most of mine were questions I never asked. Coverage is not the metric. The metric is how quickly you get from "something feels off" to "here is what is wrong", and adding panels usually makes that worse, not better.

These days when I am tempted to add a graph I make myself name the decision it informs. If I cannot, it does not get built. My lab is quieter now, in every sense, and for the first time the green actually means something, because I trust that red would shout.