i monitored everything and learned nothing

A server rack with patch cabling and a small screen showing graphs

For about six months my homelab had more dashboards than it had services. I'm not exaggerating for effect. I sat down one Sunday to tidy up and counted forty-one panels across nine Grafana dashboards, watching over maybe a dozen actual things. CPU, memory, disk, network, per host, per container, per interface, in three different colour schemes because I'd kept changing my mind about which one was less hideous.

It looked magnificent. I had a wall-mounted tablet in the hallway cycling through them. Visitors were impressed. And when my NAS started dropping SMB connections one evening, I learned about it because my wife told me the photos had stopped loading, not because any of my forty-one panels said a word.

That was the moment the penny dropped. I had built observability theatre.

how it happened

It happened the way these things always happen: one good decision at a time, each individually sensible, collectively a disaster.

I started with Prometheus and node_exporter because I wanted to know if a box was about to run out of disk. Reasonable. Then I added cAdvisor for container metrics, because why not. Then blackbox_exporter for endpoint checks. Then an SNMP exporter for the switch and the UPS. Each new exporter came with a community Grafana dashboard you could import with a single ID, and importing them is delicious. You paste a number, click load, and suddenly you have a beautiful page full of graphs you did not have to build.

The trouble is that an imported dashboard shows you everything the author thought might ever be interesting, which is not the same as the handful of things you need to know about your setup. So you end up with twelve panels per host, and the one number that matters, "is the array degraded", is either not there or buried on row four.

A tidy homelab shelf with networking gear and labelled cables

dashboards are for investigating, not for noticing

Here is the thing I'd got completely backwards. A dashboard is a tool for answering a question you already have. It is terrible at telling you that you should have a question. Nobody watches a wall of graphs at 2am. Nobody watches them at 2pm either, not really. You glance, everything looks roughly green, you move on.

The job of noticing belongs to alerting, and I had almost none. I'd been so busy making things pretty that I'd skipped the boring part: deciding what conditions actually warrant interrupting my evening.

So I rebuilt around that question. For every service I asked: what's the one symptom that means a human needs to act? Not "CPU is at 80 percent", which is usually fine and occasionally meaningless. The symptom. The NAS is the photos and the backups, so the alerts are: array is degraded, a disk's SMART status went non-zero, free space dropped below a week's growth, or SMB stopped answering. Four rules. That's it.

I moved alerting into Alertmanager and routed it to a Telegram bot, because a push to my phone is harder to ignore than an email I'll read tomorrow. A rule looks roughly like this:

groups:
  - name: nas
    rules:
      - alert: NasShareUnreachable
        expr: probe_success{job="blackbox", instance="smb://nas"} == 0
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "NAS SMB share not responding for 2 minutes"

The for: 2m matters more than the expression. Most of my early alerting pain was flapping: a thirty-second blip at 3am that woke me for nothing and trained me to ignore the channel. An alert you've learned to ignore is worse than no alert, because it costs you sleep and gives you nothing.

what i actually kept

I didn't delete the dashboards. I demoted them. They live behind a link now, not on a wall, and they exist for one purpose: when an alert fires, I open the relevant dashboard to find out why. Alert tells me something is wrong, dashboard helps me work out what. That division of labour is the whole lesson.

I kept exactly one always-on view, and it's deliberately boring. A single panel, one row per service, green or red, "is this thing doing its job". No CPU graphs. No pretty gradients. If it's all green I don't look at it, which is the point.

The wider catalogue of metrics still gets collected, because storage is cheap and you can't graph history you didn't record. But collecting a metric and putting it on a screen are different decisions, and conflating them is how you end up with forty-one panels and a NAS that fails silently.

the rule i wish i'd started with

If a panel has never once changed a decision you made, it isn't monitoring. It's decoration. Decoration is allowed, homelabs are meant to be fun, but be honest about which is which. The test I use now before adding anything to a dashboard: "what would I do differently if this number went red?" If the answer is "nothing" or "I'd have to go and look at three other things first", it doesn't earn a place at the top.

My hallway tablet is dark these days. The photos load. And on the rare evening something does break, my phone buzzes before anyone has to come and tell me. That, it turns out, was all I ever actually wanted from the whole edifice.