The first sign was a service refusing to start with no obvious reason. The second sign, found two minutes later, was df reporting the root filesystem at 100 percent. The culprit was the systemd journal, which had grown to a touch over forty gigabytes on a box where I'd never given the journal a second thought. It hadn't done anything wrong. It had simply done exactly what it was configured to do, which by default is "use up to a sensible fraction of the disk", and on a small disk that fraction is still a lot.
This is one of those things that's invisible until it isn't. The journal grows slowly, in the background, and the day it matters is the day something else is already on fire. So this is the writeup I wish I'd read first.
Find out what you're dealing with
Start by asking the journal how much room it's taking:
journalctl --disk-usage
On the offending machine that printed something close to 40G and I made a noise I'm not proud of. To see the breakdown over time, and confirm you've got runaway growth rather than one bad day:
journalctl --verify
ls -lh /var/log/journal/*/
If /var/log/journal exists, your logs are persistent and survive reboots. If they live under /run/log/journal instead, they're volatile and vanish on reboot, which is its own kind of problem but not a disk-filling one.
Reclaim the space right now
There are two immediate levers. Vacuum by size:
sudo journalctl --vacuum-size=500M
or by age:
sudo journalctl --vacuum-time=2weeks
Both are safe to run live. They delete sealed, archived journal files older than the limit and leave the active journal alone. On my box --vacuum-size=500M dropped the usage from 40G to under half a gig in a few seconds, and the disk pressure cleared instantly.
Stop it happening again
Vacuuming once is firefighting. The fix is in /etc/systemd/journald.conf, where you put a cap that the journal will respect on its own, forever. The keys I set:
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
SystemMaxFileSize=128M
MaxRetentionSec=1month
What each one actually does, because the names are easy to misread:
SystemMaxUseis the hard ceiling for the persistent journal. This is the one that matters. Set it and the journal will never exceed it, rotating out the oldest entries to stay under.SystemKeepFreetells journald to always leave this much free on the disk, regardless ofSystemMaxUse. It takes whichever constraint is tighter. Useful when the journal shares a disk with everything else, which it usually does.SystemMaxFileSizecaps individual journal files so rotation happens in reasonable chunks rather than one giant file you can't vacuum granularly.MaxRetentionSecdiscards entries older than the window no matter how much space is free, which is handy if you genuinely don't care about last quarter's logs.
Apply it without a reboot:
sudo systemctl restart systemd-journald
journalctl --disk-usage
The usage figure should now sit comfortably under your cap and stay there.
A couple of things worth knowing
A few details I learned the slightly harder way.
If you set SystemMaxUse lower than the journal's current size, the restart won't immediately shrink it. The cap applies as new data arrives and old data rotates. Run a --vacuum-size once after changing the config to bring it down straight away, then let the cap hold the line.
The defaults are not malicious, they're just generous. Out of the box SystemMaxUse defaults to 10 percent of the filesystem, capped at 4G. On a 500G server that's fine. On a 40G VM with a fat application sitting next to it, 4G of logs you never read is 4G you'd rather have back, and the percentage maths is exactly how mine crept up: the disk was bigger once, then the volume got resized down, and the journal kept the larger appetite.
And finally, if you've decided you don't want persistent journals at all, set Storage=volatile and the journal lives only in RAM, capped by RuntimeMaxUse. I don't do this on servers I want to debug after a crash, but for ephemeral, stateless boxes it's a clean answer.
Why it grew in the first place
It's worth understanding what was actually filling those gigabytes, because a cap treats the symptom and the cause is often a misbehaving service. On the box that triggered all this, the bulk of the journal was a single container logging a noisy health-check at debug level, several lines a second, around the clock. Multiply a few hundred bytes by a few per second by a few months and you arrive at forty gigabytes without anyone doing anything obviously wrong.
You can find the loudest offenders with:
journalctl --output=json --since "1 hour ago" \
| jq -r '._SYSTEMD_UNIT // ._COMM' \
| sort | uniq -c | sort -rn | head
That counts log lines per unit over the last hour and sorts them. If one unit is producing ten times what everything else does combined, you've found your problem, and the right fix is to quieten that service rather than to keep buying it more disk. In my case turning the container's log level down from debug to info cut the volume by something like ninety percent, and the journal cap became a safety net rather than a constant fight.
Centralising, briefly
If you run more than a couple of machines, the per-host cap stops being the whole answer, because the logs you most want during an incident are on the host that's currently on fire and possibly unreachable. I won't go deep here, but the shape of the solution is to ship journals off the host as they're written, either with systemd-journal-remote for a pure-systemd setup or, more commonly these days, a lightweight forwarder into something like Loki. The local cap then governs how much you keep on the box itself, while the real history lives somewhere central and durable. Even with central logging you still want the cap, though, because the local journal is your fallback for exactly the network-partition moment when the central store is the thing you can't reach.
None of this is clever. It's four lines of config and one vacuum command. But it's the difference between a journal that watches your system and a journal that eventually becomes the incident, and I'd rather it stayed in the first category. The box has been steady at well under a gigabyte ever since, and I haven't thought about it again, which is the whole point.