The disk kept filling up. logrotate was configured, logrotate was running, the rotated files were appearing with their neat little .1, .2.gz suffixes, and yet df crept upwards every day like nothing was being cleared at all. This is one of those bugs where every individual piece is working and the system as a whole is still wrong, which is the most irritating kind.
What was actually happening
logrotate's default mechanism is to rename the live log file and then create a fresh one. app.log becomes app.log.1, a new empty app.log appears, job done. This works perfectly for any application that opens its log file fresh on each write, or that reopens it when told to.
The application I was dealing with did neither. It opened app.log once at startup, held the file descriptor, and wrote to it forever. When logrotate renamed the file, the application did not notice. A file descriptor points at an inode, not a name, so it cheerfully carried on writing to the renamed app.log.1, and later to the gzipped, deleted inode that no longer had a name at all.
That last part is the killer. Once logrotate deletes the rotated file, the inode has no directory entry but is still open, so its blocks are not freed. The space is gone, invisible to du, present in df, held hostage by a process writing into a file you can no longer see.
$ lsof +L1 | grep deleted
app 1234 app 3w REG 8,1 4831838208 0 /var/log/app/app.log (deleted)
There it is. Four and a half gigabytes, one open descriptor, zero links.
The two real fixes
The clean fix is postrotate: tell logrotate to signal the application after it rotates, and have the application reopen its log.
/var/log/app/*.log {
daily
rotate 7
missingok
postrotate
/bin/kill -HUP $(cat /run/app.pid) 2>/dev/null || true
endscript
}
This is the right answer when it works. Most well-behaved daemons reopen their logs on SIGHUP, and nginx, rsyslog and friends all document exactly this. The trouble is mine did not. SIGHUP to this particular application meant "reload config", which it did, without reopening anything. I read the source to be sure. It simply never closes and reopens the descriptor, full stop.
When you cannot make the application cooperate, the fallback is copytruncate:
/var/log/app/*.log {
daily
rotate 7
missingok
copytruncate
}
This copies the current log to the rotated name and then truncates the original file in place, back to zero length, without ever renaming it. The application's file descriptor still points at the same inode, which is now empty, so it carries on writing to the file it always had. No signal required. No cooperation required.
It is not perfect. There is a small race between the copy and the truncate where a few lines written in that window can be lost. For application logs I can live with that; for anything I needed to be audit-complete I would fix the application instead. But for a noisy app that will not reopen its descriptor and whose source I would rather not maintain a patch against, copytruncate turned a daily disk emergency into a non-event.
The disk has been flat for a week. The moral, as ever, is that the open file descriptor is the truth and the filename is just a label we put on it for our own convenience.