The symptom was that a service started throwing errors that made no sense. Writes failing, then logins failing, then the whole thing falling over in a way that didn't match any code path we'd touched in weeks. Classic misdirection. The cause was that /var was completely full, and everything downstream of that was just the application flailing about with nowhere to write.
Here's the part that stings. We had disk monitoring. It alerted on the root volume crossing 90%. It did not alert on /var as a separate mount, because at some point someone had given /var its own volume and nobody had told the monitoring. So the dashboard sat there smugly green while the partition that actually mattered hit 100%.
the walk through it
First clue was df, which I should have run sooner than I did:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20G 9.1G 9.9G 48% /
/dev/sdb1 10G 10G 0 100% /var
48% on root, which is why the alert never fired. 100% on /var, which is why nothing worked. Then the hunt for what ate it:
$ du -xh /var --max-depth=2 | sort -rh | head
8.7G /var/log/journal
8.6G /var/log/journal/3f2a...
The systemd journal. No size cap, a service that had recently started logging far more verbosely after a deploy, and weeks for it to quietly eat ten gigabytes. The application had no idea why its writes were failing because, from inside the process, "no space left on device" looks like the world has gone mad.
the fix, and the real fix
The immediate fix was the obvious one, vacuum the journal and get breathing room:
$ journalctl --vacuum-size=2G
Service came back within seconds of there being space again. Two hours of incident, thirty seconds of remedy, which is the usual ratio for this kind of thing.
The real fix was three changes, none of them clever:
- A
SystemMaxUse=cap injournald.confso the journal can never do this again. - Monitoring that watches every mounted filesystem by discovery, not a hand-maintained list of paths that goes stale the moment someone adds a volume.
- A line in the runbook telling future-me to run
dfin the first sixty seconds of any "weird errors" incident, before forming any theories.
The lesson isn't "watch your disks", everybody knows to watch their disks. The lesson is that monitoring you configured once and never revisit will happily keep watching the wrong thing for years, and feel green the whole time it's lying to you. A full disk is one of the oldest failures there is, and it still got us, because the alert was looking three inches to the left of where the fire was.