The outage started, as the good ones do, with a symptom that had nothing obvious to do with the cause. The application stopped accepting writes. Not slowly, not with a graceful error, just a flat refusal. Users got 500s, the on-call phone went off, and I went looking for a problem in the application that was not, in fact, in the application at all.
I lost a good ten minutes there. The app logs showed database errors, so I looked at the database. The database logs showed it could not write, so I assumed disk, but the data partition had plenty of room. I checked it twice because the first answer did not fit my theory, which is always a sign you are checking the wrong thing.
The data partition was fine. /var was not.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20G 4.2G 15G 22% /
/dev/sdb1 500G 180G 295G 38% /data
/dev/sdc1 10G 10G 0G 100% /var
There it was. /var at 100%, zero bytes free, and /var on this box is where Postgres keeps its socket and its write-ahead bits, where journald keeps the journal, and where the application drops its own logs. Once that partition fills, half the things on the machine that assumed they could always write a few bytes discover they cannot, and they fail in creative and unrelated-looking ways. The database could not write its WAL. The app could not log why. The whole thing seized for want of a few hundred megabytes.
The culprit was depressingly mundane. A du -sh /var/* | sort -h pointed straight at /var/log, and inside it a single application log that had grown to several gigabytes because logrotate had quietly stopped rotating it. Why had it stopped? Because months earlier someone, possibly me, had changed the log path in the app config and never updated the matching logrotate rule. So logrotate was diligently rotating a file that no longer existed whilst the real one grew without limit, like a sprinkler watering the patio next to a burning shed.
The immediate fix was the boring kind. Truncate the runaway log to get breathing room, restart the database, watch the writes come back:
: > /var/log/myapp/app.log
systemctl restart postgresql
: > rather than rm, deliberately, because the application still had the file open and deleting it would have freed nothing until a restart. Truncating in place returns the space immediately. Small thing, but at 3am with the phone ringing it is the difference between a fix and a second incident.
Then the real fix, which was the logrotate rule. I pointed it at the actual path, set a sane size cap and a maxsize so it rotates on volume as well as on schedule, and added a postrotate that signals the app to reopen its handle. And I added monitoring, because the genuine failure here was not the full disk. It was that a partition went from comfortable to full and nothing told me until the application fell over. A disk-usage alert at 85% would have turned a 2am page into a Tuesday-afternoon ticket.
The thing I keep relearning is that the most boring infrastructure is the most dangerous, precisely because you set it up once and stop thinking about it. Log rotation is exactly that. You configure it on day one, it works, you forget it exists, and three years later a config drift you do not remember making takes down a service that has nothing to do with logs. Nobody puts "verify logrotate still points at the right file" on a runbook. After this, I did.