the outage caused by a full /var

Terminal showing an error

The alert said the service was down. The logs said nothing, which is its own kind of clue. When a process that normally logs cheerfully goes completely silent, the question is not "why did it crash" but "why can it no longer write". The answer, as it so often is, was that /var was full.

df -h confirmed it: /var at 100%, zero bytes free. The application could not write its logs, could not write its lock files, and in this case could not write the temp files it needed to do actual work, so it had quietly given up. Nothing had crashed in the dramatic sense. It had simply run out of room to exist and stopped.

Code and logs on a terminal

The culprit was the usual suspect. A log file that should have been rotating was not, because a logrotate config had a typo in the path and had been silently doing nothing for weeks. So one file had grown to a size that I will not embarrass myself by quoting, and eaten the partition. du -sh /var/log/* | sort -h found it in about three seconds once I thought to look.

The immediate fix was undignified but effective. I truncated the offending file rather than deleting it, because the process still had it open and deleting an open file just gives you the space back when the process closes, which was not the plan:

: > /var/log/the-greedy-one.log

That freed the space, the service noticed it could write again, and recovered on its own without a restart. Total downtime was longer than it should have been only because I spent the first ten minutes looking for a clever cause when the boring one was sat in df.

The actual lesson is not "disks fill up", everyone knows disks fill up. It is that I had no alerting on disk usage, so the first I knew of a slow problem was the sudden total failure at the end of it. A full disk is the most predictable outage there is. It announces itself for days if you are watching, and I was not.

So the fixes were dull and correct. Fix the logrotate path so it actually rotates. Add a disk-usage alert that fires at eighty per cent, well before anything breaks, so the next one is a calm afternoon job rather than an outage. And put /var/log somewhere that filling it cannot take the rest of the system down with it. None of that is clever. The whole episode is a reminder that most outages are not exotic. They are something obvious that nobody was looking at.