Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the disk wasn't full, /var was

A service that fell over not because the root disk was full but because /var was on its own partition and a runaway log had quietly filled it.

A terminal showing a disk-full error from a logging daemon

A service stopped accepting writes one afternoon and the first thing everyone said was "disk's fine, look", waving df -h / at me showing eighty percent free. And the root filesystem genuinely was fine. The problem was that /var lived on its own partition, a tidy decision someone made years ago, and /var was at 100%.

A logfile had run away. Some library had started emitting a warning on every single request, the request rate was healthy, and a few hours of that was enough to fill a partition that had been comfortable for years. Postgres couldn't write its WAL, the app couldn't write its own logs, and everything ground to a polite halt with errors that all pointed at "disk" without saying which one.

The tell is always df -h with no argument, the whole table, not the one mount you assumed was the culprit. There it was: / relaxed, /var wedged.

$ df -h | sort -k5 -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        40G   31G  6.8G  82% /
/dev/sdb1       9.8G  9.8G     0 100% /var

Truncating the runaway file freed it instantly and the service recovered on its own. The real fix was duller: a logrotate rule that actually applied to that file, an alert on per-mount usage rather than just root, and a quiet word with the library that had decided every request was worth shouting about. Separate partitions save you from one runaway filling everything. They also let you stare at the wrong number for twenty minutes. Check every mount.