Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage caused by a full /var

A production incident that turned out to be nothing more exotic than a disk filling up under /var, and the small habits that would have caught it.

A terminal showing a failing service

The service started returning 500s at about four in the afternoon, which is the worst possible time for anything to break because it is too late to be fresh and too early to go home. The app logs said nothing useful. The database was fine. The load balancer was healthy. Everything you would naturally suspect was innocent, which is always a bad sign, because it means the cause is something you have stopped looking at.

It was the disk. Specifically /var, at 100%.

A close-up of code on a screen

The trail, once I bothered to look, was embarrassingly short. df -h showed /var completely full. The application could not write its session files, the database could not extend its write-ahead log, and the logging daemon could not even record that any of this was going wrong, which is why the logs were so peacefully quiet. A full disk does not announce itself loudly. It just makes everything that touches the filesystem fail in slightly different ways at slightly different times, so the symptoms scatter and none of them point at the cause.

What had filled it was the dullest thing imaginable: log files. A library had been bumped to debug verbosity during a deploy a few weeks earlier, the change had quietly shipped, and logrotate was configured to rotate by time rather than size. So the logs grew, the rotation politely waited for its daily schedule, and the disk lost the race. du -sh /var/log/* | sort -h made the culprit obvious in seconds, once I was looking in the right place at all.

The fix on the day was trivial. Truncate the offending log, dial the verbosity back to where it should have been, restart the service, watch the 500s stop. Ten minutes, most of it spent feeling slightly foolish.

The fix that mattered came after. A disk-usage alert at 80% on every volume, not just the headline data disk, because /var is exactly the boring partition nobody watches until it bites. Rotation by size as well as time. And a note to myself that when the application logs go suspiciously silent during an incident, the very first thing to check is whether the thing writing the logs can actually write anything at all. A full disk is not a clever failure. It is just one that hides in the one place you stopped instrumenting, and it will happily take down a perfectly good system while you stare at the database.