Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage that was just a full /var

A production service falling over for the most embarrassingly mundane reason there is, a disk full of logs, and the cheap monitoring that would have caught it.

A terminal showing a bug

The service was throwing 500s and the logs said nothing useful, which should have been the tell. When the logs go quiet during an incident, it's worth asking whether the logs can be written at all. They could not. /var was full.

The cause was the dullest one in the catalogue. A log file had been rotating but not being cleaned up, an application was suddenly far chattier than usual after a deploy, and the two together had filled the partition. Postgres couldn't write its WAL, the app couldn't write its access log, and everything that needed to put a byte on /var started failing in its own way. df -h told the whole story in one line, and du -sh /var/log/* | sort -h named the culprit in the next.

$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        20G   20G   12K 100% /var

The fix took two minutes: truncate the offending log, restart the database, watch the 500s clear. The embarrassing part is that this is a solved problem. Disk-full is not a clever failure. It's the first thing you'd put a check on if you sat down and listed what could go wrong, and somewhere along the way nobody had, on this box.

So the actual fix wasn't clearing the disk, it was the alert that should have existed all along: a check that pages when any partition crosses 85%, well before it becomes an outage. That's a handful of lines in whatever monitoring you already run, and it turns a 2am incident into a calm ticket you handle on a Tuesday. The logrotate config got a maxsize too, so a chatty deploy can't fill the disk between nightly rotations.

The lesson, such as it is: the boring failures are the ones that get you, precisely because they're too boring to design against. Nobody writes a postmortem proud of "the disk filled up". But it's an outage all the same, and it's the cheapest one in the world to prevent. Check your disks. That's the whole post.