i have too many dashboards and not enough alerts

A rack of servers with blinking lights

I counted my Grafana dashboards last weekend. Thirty-one. For a homelab that runs maybe a dozen meaningful things. Most of them I have not opened since the night I built them, in that brief euphoric phase where every new exporter feels like it deserves its own page of graphs.

The uncomfortable truth is that dashboards are not monitoring. They are monitoring's photogenic cousin. A dashboard tells you something is wrong only if you happen to be looking at it, which, for a thing in a cupboard you ignore for weeks at a time, is approximately never. What actually protects you is an alert that reaches you when you are not looking. I had built thirty-one of the former and about four of the latter.

A Grafana-style dashboard glowing in a dim room

the dashboard trap

It is so easy to fall into. You add node_exporter, and suddenly Prometheus is drowning in lovely metrics: CPU per core, memory, disk IO, network, filesystem, entropy, context switches. Grafana has community dashboards for all of it, you import one, and now you have a beautiful page. Repeat for cadvisor, for the ZFS exporter, for SMART, for the UPS, for the PDU. Each is satisfying to make. None of them will ever wake you up.

What I genuinely watch, when I am honest, is one overview page. Are the hosts up, is anything pegged, is a disk filling. Everything else is there for when an alert has already told me where to look, which is exactly the right job for a dashboard and exactly the wrong job to have thirty-one of.

inverting it

So I am inverting the whole thing. The question is no longer "what can I graph" but "what would I want to be told about at 2am". For me that short list is:

a filesystem above 85% (because journald taught me that lesson recently)
a host that has stopped reporting for more than five minutes
a ZFS pool that is degraded or scrubbing up errors
a container restart-looping
the UPS running on battery

That is it. Each one becomes an alerting rule in Prometheus, routed through Alertmanager to a single channel I will actually see. A rule looks like this, which is hardly intimidating:

- alert: FilesystemNearlyFull
  expr: |
    node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
      / node_filesystem_size_bytes < 0.15
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.instance }} {{ $labels.mountpoint }} below 15% free"

The for: 10m matters more than people give it credit for. It means a brief spike during a backup does not page me; only a genuine, sustained problem does. The fastest way to teach yourself to ignore alerts is to send yourself flappy ones, and an ignored alert is worse than no alert because it lets you believe you are covered.

I am not deleting the dashboards. They are cheap to keep and occasionally one is exactly what I need mid-incident. But I have stopped pretending they are doing the watching. The watching is now five rules and one notification channel, and for the first time the homelab will tell me when it needs me rather than waiting for me to wander past and notice. That, belatedly, is the point of monitoring something you otherwise ignore.