the outage that was just a full /var

A bug in a terminal

The page said the API was returning 500s. Not all of them, which is always worse than all of them, because "everything is down" points you somewhere and "some things are down sometimes" points you nowhere. By the time I'd got to a laptop the error rate had wandered back down on its own, then climbed again, then settled into a sawtooth that looked almost deliberate. It was not deliberate. It was a disk filling up, draining slightly, and filling again.

I'll spoil the ending because the ending isn't the interesting part. /var was full. The interesting part is how many different and unrelated-looking ways a full /var chooses to tell you, and how long I spent chasing each of those symptoms before I checked the one thing I should have checked first.

the symptoms, none of which said "disk"

The first thing I actually looked at was the application logs, which is to say I tried to. The log file had stopped being written to about forty minutes earlier. No errors, no rotation, it just stopped, mid-line. A logging library that silently gives up when it can't write is a special kind of unhelpful, and I made a note to be cross about that later.

So I went to the database, because the 500s smelled like the database. The application couldn't get a connection, sometimes. Postgres was up. Postgres was also unhappy, because its own logs and its WAL live under /var, and a database that can't write its write-ahead log is a database that stops accepting writes. The errors in the client weren't lying, they were just two steps removed from the cause.

A terminal full of errors

Then the SSH session into a second box hung on login. Took me a moment. pam_systemd and friends want to write to /var/log, sudo wants to write to /var/log too, and a login that can't journal anything can sit there for an awkwardly long time before it decides to let you in anyway. At which point my own keystrokes felt like they were travelling by post.

The cherry on top: the monitoring agent that should have shouted "disk at 100%" had itself stopped reporting, because its spool directory was, you've guessed it, under /var. The one tool whose entire job was to warn me about this had been taken out by the very thing it was meant to warn me about. There's a design lesson in there and it isn't a comfortable one.

actually finding it

Once I stopped following symptoms and ran the boring command, it took five seconds:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   20G   20K 100% /

One filesystem, no separate mount for /var, and that's its own story. Now, what was eating it. du on a slow, busy box is a good way to make a bad day worse, so I aimed it:

$ sudo du -x -h --max-depth=1 /var | sort -rh | head
14G   /var/lib/docker
3.2G  /var/log
2.1G  /var/lib/postgresql

/var/lib/docker. Of course. Months of docker build on this host had left a graveyard of dangling images, stopped containers, and orphaned volumes that nothing was ever going to reference again. And /var/log at 3.2G was a logrotate config that rotated but never actually compressed or deleted, so it had simply been accumulating since the box was built.

Disk usage broken down by directory

the fix, in the right order

First, buy headroom without deleting anything I might want, so I could think clearly. Truncate the largest, safe-to-truncate offenders:

$ sudo journalctl --vacuum-size=200M
$ sudo find /var/log -name '*.log.[0-9]*' -delete

That bought a couple of gigabytes and brought the box back to life almost immediately. Postgres started writing its WAL again, the application's logger reconnected to a file it could write to, and the 500 rate dropped to zero. The relief is always disproportionate to the cause.

Then, the real culprit, carefully, because docker system prune with the wrong flag deletes things you wanted:

$ docker system prune --volumes
$ docker image prune -a --filter "until=720h"

That reclaimed twelve gigabytes. I watched the volume filter twice before pressing return, because deleting a named volume that turns out to hold real data is how a disk-space incident becomes a data-loss incident.

what I changed afterwards

The incident was an afternoon. The follow-up was the useful bit.

/var now has its own filesystem, or at least a quota, so a runaway log can't take the root partition and everything that depends on it down with it.
Disk alerts fire at 80%, not 95%, and the alerting path does not store its own state under the partition it is watching. That was the genuinely embarrassing one.
A weekly docker system prune on build hosts, scheduled, because I am not to be trusted to do it by hand.
The logging library now fails loudly, or at least to stderr, when it can't write. A logger that goes quiet exactly when something is wrong is worse than no logger at all.

None of this is clever. A full disk is the most boring outage there is. But it taught me, again, that the symptom and the cause can be a long way apart, and that the fastest path through an incident is usually to stop reading the smoke and go and check whether anything is actually on fire. df -h first. Always df -h first.