Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage that was just a disk quietly filling up

A service kept falling over at 3am with cryptic errors, and the cause turned out to be a /var partition slowly filling with logs nobody was rotating.

A terminal full of log output with an error highlighted

A service started dying in the small hours with errors that made no sense. The application logged a database connection failure, but the database was fine. Then it logged "no space left on device", which at least sounded specific, except df -h on / showed plenty of room. So I went back to bed mildly annoyed, and it happened again the next night.

The trick was that the space wasn't gone from /. It was gone from /var, which was its own partition, and which df -h will happily tell you about if you bother to read past the first line.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   18G   32G  36% /
/dev/sdb1        10G   10G     0  100% /var

There it is. /var was wedged at 100%. The application's "database connection failure" was a lie: the real failure was that it couldn't write its own connection pool's temp files, and the error handling, written by an optimist, reported the symptom three layers up instead of the cause.

The culprit was a debug log that someone had enabled months earlier "just temporarily" and never turned off. It was writing a few hundred megabytes a day, nothing alarming, but it wasn't in logrotate's config because it wasn't where the standard logs live. So it grew, slowly, until one night it tipped /var over the edge, and because the failures happened during the nightly batch job that wrote the most, it always struck around 3am.

The fix took five minutes: truncate the log, add it to logrotate, turn the debug level back down. The lesson took longer to sink in. Monitor partition usage, not just root, and alert at 80% so you find out at teatime instead of at 3am. And when an error says one thing, check that the thing underneath it agrees. The application was certain it was a database problem. The application was wrong, the way applications usually are about why they've actually died.