Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the night /var filled up and took the app with it

A small outage that came down to a full /var, an unrotated log, and the disappointing lesson that monitoring you don't alert on is just a dashboard.

A terminal showing an error trace

The app started throwing 500s at about nine in the evening, which is exactly when nobody wants to be paged. The logs were unhelpful in the most ironic way possible: they'd stopped being written, because the thing that had failed was the act of writing logs.

df -h told the whole story in one line. /var was at 100%. Not the data volume, not the root partition I actually watch, but /var, sitting on its own small filesystem that I'd carved off years ago and then promptly forgotten about. A debug log on a chatty service had been left at verbose after a deploy, logrotate wasn't covering that path, and it had quietly eaten the lot over about a fortnight.

The fix took ninety seconds. Truncate the offending log with : > /var/log/the-noisy-one.log, restart the service, watch the 500s stop. The embarrassing part is that I had a Grafana panel for disk usage on that very host. I'd just never put an alert on it, so the graph dutifully climbed towards the cliff for two weeks and told nobody. A dashboard you have to remember to look at is not monitoring. It's decoration.

I added a logrotate stanza for the path, dropped the log level back to where it should have been, and wired a disk-usage alert at 85% on every filesystem rather than just the obvious ones. The real lesson isn't about logs. It's that I'd been measuring the thing without being told when it went wrong, and those are not the same job.