The disk filled up. df said 98% full, du on the log directory said the logs were a sensible few hundred megabytes, and logrotate's own status file swore it had rotated cleanly that morning. Two tools, two confident answers, both wrong in the way that wastes an hour.
The culprit is an old one. The app holds an open file descriptor to its log. logrotate renames app.log to app.log.1 and creates a fresh app.log. But renaming a file does not touch the open handle. The process is still writing happily to the same inode, now called app.log.1, and on the next cycle logrotate compresses or deletes that file, at which point the process is writing to a file with no name at all. The bytes still go somewhere. They just stop being visible to du, because du walks names and this file no longer has one.
proving it
lsof is the tell. A deleted-but-open file shows up explicitly:
lsof -nP +L1 | grep deleted
The +L1 asks for files with a link count below one, which is precisely the "no name, still open" case. Sure enough, there was the app, holding 14GB of log that existed only as an open descriptor. That is your missing disk space: real, occupied, and invisible to the obvious tools.
The whole point of logrotate's postrotate step is to tell the app to let go and reopen, usually with a signal:
postrotate
kill -HUP $(cat /run/app.pid)
endrotate
The trouble is that SIGHUP only does anything if the application actually handles it and reopens its logs. Plenty do. Plenty do not, especially homegrown daemons that someone wrote to log to a path and never thought about rotation. This one ignored HUP entirely. logrotate sent the signal, the app shrugged, and the handle stayed open.
the fixes, least to most invasive
In rough order of how much I like them:
copytruncate. Tell logrotate to copy the file then truncate the original in place, leaving the inode (and the app's handle) intact. Simple, no app cooperation needed. The cost is a small race window where lines written mid-copy can be lost, and a brief doubling of disk use during the copy.- Make the app reopen properly. If you control the code, handle SIGHUP and reopen the log file. This is the correct fix and the one worth doing if you can.
- Restart on rotate. Heavy-handed, but for a service that genuinely cannot reopen and tolerates a blip, a
restartin postrotate is honest about what is happening.
I went with copytruncate for now, because I did not own the binary and the lost-line race is acceptable for this particular log. It is a workaround, not a fix, and I have noted it as such so future-me does not mistake it for design.
The reclaim, by the way, was instant once the handle was released. The kernel only frees the blocks of a deleted file when the last descriptor closes, so the moment the app let go, df snapped back to something sane. No reboot, no fsck, just a process finally admitting the file it was writing to had been gone for days.