A service started dying in the small hours with errors that made no sense. The application logged a database connection failure, but the database was fine. Then it logged "no space left on device", which at least sounded specific, except df -h on / showed plenty of room. So I went back to bed mildly annoyed, and it happened again the next night.
The trick was that the space wasn't gone from /. It was gone from /var, which was its own partition, and which df -h will happily tell you about if you bother to read past the first line.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 18G 32G 36% /
/dev/sdb1 10G 10G 0 100% /var
There it is. /var was wedged at 100%. The application's "database connection failure" was a lie: the real failure was that it couldn't write its own connection pool's temp files, and the error handling, written by an optimist, reported the symptom three layers up instead of the cause.
The culprit was a debug log that someone had enabled months earlier "just temporarily" and never turned off. It was writing a few hundred megabytes a day, nothing alarming, but it wasn't in logrotate's config because it wasn't where the standard logs live. So it grew, slowly, until one night it tipped /var over the edge, and because the failures happened during the nightly batch job that wrote the most, it always struck around 3am.
The fix took five minutes: truncate the log, add it to logrotate, turn the debug level back down. The lesson took longer to sink in. Monitor partition usage, not just root, and alert at 80% so you find out at teatime instead of at 3am. And when an error says one thing, check that the thing underneath it agrees. The application was certain it was a database problem. The application was wrong, the way applications usually are about why they've actually died.