how a single full filesystem took down a perfectly healthy service

A terminal showing a disk-full error

The first symptom was not "disk full". It is almost never "disk full", which is exactly why a full filesystem makes such a miserable outage. The first symptom was the application refusing new connections, then the database complaining it could not write, then systemd unable to start a unit it had restarted a thousand times before, then me unable to log in cleanly because even the shell wanted to write something somewhere and could not. A dozen unrelated-looking failures, all radiating out from one cause that none of them named.

$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/mapper/vg-var  20G   20G   20K 100% /var

There it is. Twenty kilobytes free on a twenty-gigabyte filesystem, which is to say full. And /var being full is uniquely nasty because of everything that lives there: logs, databases, spool directories, the journal, package caches, container storage, lock files, sockets. When /var cannot take another byte, an enormous number of things that have nothing to do with each other all discover they cannot do their jobs at the same moment.

The cascade

Watching it fail was almost educational, in the way a fire is educational if it is not your house.

The database went read-only because it could not extend its write-ahead log. The application threw connection errors because the database threw write errors. The logging that would normally have told me all this could not be written, because the place it writes is the thing that is full, so the logs about the disk being full did not exist. systemd could not start the failed services because starting them involves writing state under /var, so my reflexive restart did nothing but produce more errors I could not see. Even my SSH session was sluggish because the journal could not flush.

A close-up of log output and code on a terminal

This is the trap of a full disk: it disables the very tools you reach for to diagnose it. You cannot reliably log, you cannot reliably write a temp file, and half your instinctive recovery moves quietly fail because they all assume they can write.

Finding what ate the space

Once I accepted the actual problem, finding the culprit was quick. The thing you want is to ask which directories are largest, and you want a tool that does not itself need to write much:

du -xh --max-depth=2 /var 2>/dev/null | sort -rh | head -20

The answer was a single application log that had grown to several gigabytes because logrotate had silently stopped rotating it. Three months earlier someone had changed the log path in the app config, the logrotate rule still pointed at the old path, and so the new file grew without bound while logrotate dutifully rotated a file that no longer received writes. No error, no warning, just a steadily growing file nobody was looking at, ticking towards the cliff.

Getting back up

The immediate move is to free just enough space to let the machine breathe, then fix things properly. I truncated the runaway log in place rather than deleting it, because the application still had the file open and deleting an open file just hides the space until the process closes the handle:

truncate -s 0 /var/log/theapp/current.log

That gave back gigabytes instantly. The database came out of read-only on its own once it could write again, the services restarted cleanly, and the logs reappeared because there was somewhere to put them. The whole recovery, once I had correctly identified the cause, took less than five minutes. Identifying the cause had taken twenty, because of all the misleading symptoms pointing everywhere except at the disk.

Stopping it happening again

Three changes, in increasing order of how much they actually matter.

First, I fixed the logrotate rule and added a check that screams if any rule references a path that does not exist, because a rotation rule pointing at a stale path is worse than no rule at all: it looks like protection while protecting nothing.

Second, and this is the real fix, I added disk-usage alerting that fires at 80 percent and pages at 90, on every filesystem, not just the root one. The outage was entirely preventable with a warning a day or two earlier, when there was time to act and the tools still worked. A full disk should never be a surprise; the disk spent days getting full and told nobody.

Third, I separated the noisiest write-heavy directories onto their own volumes where it made sense, so that a runaway log fills its own filesystem and degrades one thing, rather than filling shared /var and taking down the database, the application and my ability to log in all at once. Blast radius matters. A full disk is survivable; a full disk under everything important is an outage.

The lesson I keep is almost too simple to write down: monitor free space before it runs out, because once it has run out the machine takes away the tools you would use to fix it. Everything else that day was downstream of one number quietly climbing towards 100 percent with nobody watching it.