the outage was just /var being full

A terminal showing disk usage errors

The app started returning 500s at about half four, and the logs said the database couldn't write. So naturally everyone went and stared at the database, which was fine, which is the worst kind of fine because it sends you off looking for ghosts.

The database wasn't the problem. The disk was. df -h on the box showed /var at 100%, and a database that can't write a single byte to disk looks, from the outside, exactly like a database that's broken. It wasn't broken. It just had nowhere to put anything.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        20G   20G   20M 100% /var

The culprit was a debug log somebody had switched on weeks earlier to chase an unrelated issue, then never switched off, and which logrotate had never been told about. So it just grew, quietly, a few hundred megabytes a day, until the day it didn't fit. No alert, because nobody was watching free space on that volume. The fix took thirty seconds: truncate the log, restart the service, add a logrotate stanza. The annoying part is that it would have taken thirty seconds three weeks earlier too, if anything had been looking.

I added a disk-space check to monitoring that same evening, the cheapest alert in the world and the one I'd somehow never set up on that host. "The disk is full" is the most boring root cause there is, and it has cost me more hours than anything clever ever has.