I have Prometheus scraping everything in the rack and Grafana sitting on top of it, and at last count fourteen dashboards. CPU, memory, disk, network, per-container stats, temperatures, the lot. It looks magnificent. It is, in practice, almost completely useless.
The problem is that a dashboard only helps if you are looking at it, and I am never looking at it. I look at a dashboard once, when I am building it and admiring my work, and then never again until something has already gone wrong and I am hunting for the cause. By then the graph is just confirming what I already know from the thing being broken.
What actually saves me is the boring stuff I keep neglecting: alerts. A single Alertmanager rule that pings me when a disk crosses 85% has caught more real problems than all fourteen dashboards combined, because it comes to me rather than waiting for me to go and look. The dashboard is passive. The alert is the bit that does the work.
So the resolution, written down so I am held to it: stop building dashboards I won't read, and put that time into a handful of alerts that fire before things break. A graph tells you what happened after you already care. An alert is what makes you care in time. I have fourteen of the former and about three of the latter, and the ratio is exactly the wrong way round.