The page came in at twenty to one in the morning, which is the hour reserved for problems that are entirely your own fault from six weeks ago. Half the platform was throwing errors, the database was refusing writes, and a service that had been rock solid for a year was suddenly unable to do the one thing it existed to do. The cause, once I found it, was the most boring sentence in operations: /var was full.
The blast radius of a full disk
What surprised me was not that a disk filled up. Disks fill up. What surprised me was how much fell over because of it, and how far the symptoms travelled from the cause.
When /var fills, everything that wants to write there starts failing, and an astonishing amount of software wants to write to /var. The database keeps its data there and went read-only rather than risk corruption. The system journal could not write, so the very logs I would use to diagnose the problem stopped recording at the worst possible moment. A couple of services that buffer to a spool directory began throwing, then crash-looping, then taking their dependents down with them. One full filesystem, and the failure rippled outward into a dozen unrelated alerts that looked for all the world like a serious incident.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 20G 20G 0 100% /var
That zero in the Avail column is the whole story. Everything downstream of it is noise.
Finding the culprit
The temptation at one in the morning is to start deleting things to make the pain stop. Resist it, or at least aim before you fire. The fastest honest answer to "what ate the disk" is to walk the tree by size:
$ sudo du -x -h -d1 /var | sort -h | tail
1.2G /var/cache
3.4G /var/spool
15G /var/log
Fifteen gigabytes of logs on a twenty gigabyte volume. Down into /var/log and there it was: a single application log, rotated never, fed by a debug flag that somebody, and by somebody I mean me, had flipped on during an investigation in May and never flipped back. It had been growing quietly at a few hundred megabytes a day ever since, a slow tide nobody was watching, until it touched the ceiling and everything that shared the volume drowned at once.
I truncated it rather than deleting it, because an open file handle on a deleted log gives you the worst of both worlds: the space stays gone until the process restarts, and you have spent your one clever move achieving nothing.
$ sudo truncate -s 0 /var/log/app/debug.log
Space came back instantly, the database came out of read-only on its own, the crash-looping services settled, and the cascade unwound in about the order it had wound up. Total time from page to green was maybe fifteen minutes, of which fourteen were spent finding the file and one was spent fixing it. That ratio is the usual one.
What I changed so future-me sleeps
Three things, none of them clever, all of them things I should have had already.
- Turned the debug flag off, obviously, and then went looking for everywhere else I might have left one on.
- Put the offending log under logrotate with a size cap and a sane retention, so it can never again grow without bound.
- Added a disk-space check that pages at eighty percent, not at a hundred. The whole tragedy here is that the only alert was the outage itself. A filesystem at eighty percent is a quiet word in your ear. A filesystem at a hundred percent is a fire.
The real lesson is duller and more useful than any of the fixes. A debug flag is a loan against your future, and I forgot I had borrowed. The thing that took the platform down was not a bug or an attack or a clever cascade, it was a decision I made in good faith one afternoon and then completely forgot about. So now the rule is simple: anything I switch on for an investigation gets a reminder to switch it off, and anything that writes to disk forever gets a rotation policy before it writes a single byte. Boring insurance against boring disasters, which are the ones that actually get you.