The outage announced itself as a service that had simply stopped doing its job. Requests timed out. The logs, when I could get to them, were a mess of errors that didn't agree with each other: a database connection that couldn't write, a session store that couldn't save, a temp file that couldn't be created. Three different subsystems failing three different ways at the same moment. As with most outages where everything breaks at once, the truth was that one thing broke and dragged the rest down with it.
The one thing was disk. /var was full. Not nearly full, completely full, zero bytes free, and a great many things on a Linux box quietly assume they can always write a few bytes somewhere under /var. When they can't, they don't fail cleanly with "disk full". They fail in whatever creative way their error handling allows, which is how you end up with a database error, a session error and a temp-file error all pointing at a cause that's none of them.
finding it
The diagnosis took about ninety seconds once I thought to look, which is the galling part. df told the whole story:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20G 18G 0 100% /
/dev/sdb1 50G 12G 35G 26% /var/lib/data
Root at a hundred percent, which meant /var along with it because it wasn't on its own partition on this box. The next question is always "full of what", and du answers it, though you have to point it at the right place and let it churn:
$ du -xh /var --max-depth=2 | sort -rh | head
14G /var/log
13G /var/log/oldservice
...
There it was. A service we'd decommissioned months earlier had been removed from the running config but its log directory had stayed, and something, a leftover cron job, was still appending to it. No rotation, no cap, just a file growing by a few hundred megabytes a day, invisibly, until the day it crossed the line and took the box with it.
the actual fault wasn't the log
Truncating the file got the service breathing again within seconds. truncate -s 0 on the offending log rather than rm, because something still had it open and deleting an open file just hides the space until the writer is restarted, which is its own confusing afternoon. Space freed, services recovered, outage over. But the log was never the real fault.
The real fault was that a disk filling up over weeks is the most predictable failure there is, and we had no warning of it whatsoever. Disk usage doesn't spike. It creeps. It gives you days, sometimes weeks, of advance notice, and we'd built a monitoring setup that watched CPU and memory and request rates and said nothing at all about the one resource that was about to end us. A graph that any human glancing at it would have read as "this ends badly on roughly the 28th".
what changed
Three things, and only one of them is interesting. The dull two: a disk-usage alert that fires at eighty percent and again at ninety, on every partition, so a slow creep gets caught with days to spare. And logrotate actually configured for everything writing under /var/log, with a size cap and a retention count, so no single log can grow without bound regardless of who's writing to it.
The interesting one is the decommissioning gap. The service was "gone" in the sense that mattered to whoever turned it off, but it had left bits of itself scattered around: a log directory, a cron entry, presumably other things I haven't found. Turning a thing off is not the same as removing it, and the difference is exactly the kind of debris that fills a disk eight months later. So decommissioning is now a checklist, not a vibe: stop it, remove its config, remove its cron, remove its logs, remove its user. Boring, and it would have meant this outage never happened.
The lesson I keep relearning is that the loudest failures often have the quietest, slowest causes. A full disk doesn't arrive suddenly. It arrives on a schedule you could have read off a graph weeks earlier, if only you'd been drawing the graph.