Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage nobody saw coming, because /var was full

A service stopped accepting writes one afternoon, and the cause was a full /var partition nobody had thought to monitor.

A terminal showing a disk usage error

The service stopped accepting writes at about half three on a Tuesday. No crash, no panic in the logs, just a steady stream of 500s and a request queue that climbed and never came down. The application logs said almost nothing, which is usually a clue in itself: if the app can't even write its own error, the problem is underneath the app.

It was. df -h told the whole story in one line. /var at 100%. The logger had been quietly writing a stack trace per failed request, each failure begetting the next, and somewhere in there the database's WAL had nowhere left to go.

The annoying part is that we monitored disk on /, which had plenty of room. /var was a separate partition, as it should be, and absolutely nobody had added it to the check. So the dashboard stayed reassuringly green whilst the box quietly seized up.

Cleared the worst of the logs, writes resumed inside a minute, queue drained. Then the actual work: a df check across every mounted filesystem rather than the root one, a log rotation policy that doesn't depend on someone remembering, and a note to self that "we monitor disk" is not the same as "we monitor the disk that fills up". They never are the same partition. It's a law.