Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the night /var filled and took everything with it

A full /var partition that took down a host in ways that had nothing obvious to do with disk, the cascade of failures it caused, and how I now keep /var from being a single point of failure.

A terminal showing a disk-full error

The alert that came in was not "disk full". It was a tangle of unrelated-looking failures: the application couldn't accept new connections, the database was logging errors about being unable to write, systemd was refusing to start units, and sudo was taking ten seconds and then complaining. Four different subsystems, four different error messages, no obvious thread connecting them. I spent the first twenty minutes treating it as four problems, which is exactly the wrong instinct and exactly the one a full disk encourages, because the symptom never points at the cause.

The cause was that /var was at 100%. Not the root filesystem, which had plenty of room, just /var on its own partition, which on this box was where the logs, the journal, the database, the package cache, the spool, and roughly everything that writes to disk during normal operation all lived. And when /var is full, an astonishing number of things break in ways that don't say "disk".

why a full /var breaks things that aren't about disk

Once I knew it was /var, the four problems collapsed into one. They were all the same problem wearing different masks.

The database couldn't write because its data files are under /var/lib. Fair enough, that one at least mentioned disk. systemd couldn't start units because it writes to the journal under /var/log/journal and a journal that can't be written can wedge unit startup. sudo was slow and grumpy because it logs through the system logger, which was blocking on a full disk and timing out. And the application couldn't accept connections because it was trying to write a log line per connection and blocking on the write. None of them said "the disk is full" in a way that survived being read at two in the morning. They said "I can't do my job", and the reason was the same for all of them.

$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/mapper/vg-var  20G   20G   0G  100% /var

$ journalctl -xe
Failed to write to journal: No space left on device

That second line was the Rosetta Stone. The moment I saw No space left on device from the journal, every other symptom rearranged itself into a single cause.

Code on a monitor showing a df command and disk output

what actually filled it

du -xh --max-depth=1 /var | sort -h pointed straight at /var/log, and within that, one application log that had grown to gigabytes because, you've guessed it, it wasn't being rotated. A debug flag had been left on after an investigation weeks earlier. The app logged a verbose line per request, the volume climbed steadily, and one ordinary Tuesday evening it crossed the line and took the host with it.

Recovery itself was quick once I could see it. Truncate the offending log so I had breathing room, restart the logger and the journal so the rest of the system unstuck itself, then turn the debug flag back off and rotate the log properly. The clearing took five minutes. The diagnosis took forty, which is the wrong ratio, and the whole point of the post is to fix that ratio for next time.

# breathing room first, properly second
truncate -s 0 /var/log/app/debug.log
systemctl restart systemd-journald
df -h /var

truncate rather than rm, because the app still had the file open and deleting it would just have given me the invisible-inode problem instead of fixing anything; the space wouldn't come back until the process let go.

stopping it being a single point of failure

A full /var is one of those failures where the cleanup is trivial and the prevention is the whole job. A few things changed after this.

Monitoring on disk that alerts on the trend, not just the threshold. Alerting at 90% full is too late if the thing fills in an hour, and useless if it fills over a month with nobody looking. What I actually want is "this partition will be full in under N days at the current rate", which catches both the slow creep and the sudden spike. The flat 90% threshold stays as a backstop.

Log rotation as a non-negotiable, with a hard maxsize so no single log can run away regardless of the time-based schedule. The original setup rotated daily, which is no help at all when a log can grow by twenty gigabytes between midnights.

And a small thing that buys a lot of grace: a reserved block percentage, or a placeholder file you can delete to instantly free space when you've already filled the disk. The horrible bit of a full-disk outage is that you sometimes can't even fix it, because the tools you need to fix it also want to write somewhere. Knowing there's a 500MB file you can rm to get yourself moving again turns a panic into a procedure.

The lesson I keep relearning is that disk pressure is a liar. It never tells you it's a disk problem. It tells you the database is broken, the app is broken, sudo is broken, systemd is broken, and lets you waste twenty minutes chasing four ghosts before you think to run df. So now df -h is the first command I run on any host misbehaving in more than one place at once, before I've even formed a theory. It costs nothing, and roughly one time in five it ends the investigation before it starts.