The disk filled up. Not slowly, not with a warning, just a 3am page and a root filesystem at 98%. The culprit was a log file that logrotate swore it had rotated five days ago. ls -la /var/log/theapp/ showed a tidy app.log of 12KB and five gzipped archives. du -sh said the directory was 40GB. Those two facts cannot both be true, and yet there they were.
The answer is one of those things you only learn by losing an evening to it. logrotate did its job. It renamed app.log to app.log.1, then later deleted the older archives, and then sent the app a SIGHUP to tell it to reopen its log file. The app, written by someone who had never read the logrotate man page, ignored SIGHUP entirely. So it kept its original file descriptor open, pointing at an inode that no longer had a name. The file was unlinked but not freed, because a process still held it open, and it kept growing into the void.
$ lsof +L1 | grep theapp
theapp 4123 app 3w REG 259,2 41203847168 0 131074 /var/log/theapp/app.log (deleted)
There it is. (deleted), 41GB, and a link count of 0. df and du disagree because du walks the directory tree and the file has no directory entry left to walk. The space comes back the instant the process closes the descriptor, which in practice means a restart, or a slightly grim trick with gdb and close() that I will not recommend in writing.
the fix is not to fix the app
You might think the answer is to make the app handle SIGHUP. It is, eventually, if you own the code. But I did not own this code, and the maintainer's idea of a release cadence was geological. So the pragmatic fix lives entirely in the logrotate config, with copytruncate:
/var/log/theapp/app.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
copytruncate
}
copytruncate copies the live file to the rotated name and then truncates the original in place, rather than renaming it. The app's file descriptor still points at the same inode, which is now back to zero bytes, and it carries on writing happily without ever knowing anything happened. No signal required, no cooperation required.
There is a small honesty tax to pay. There is a tiny window between the copy and the truncate where a few log lines can be written and then lost. For an audit log that matters. For a chatty application debug log, it does not, and I will take the occasional dropped line over a 3am disk-full page every time.
The wider lesson, and the reason I am writing this down rather than just muttering it, is that df lying to your face is almost always a held-open deleted file, and lsof +L1 is the first thing to reach for. The number of times I have watched someone delete files in a panic, see no change in df, and conclude the disk is haunted… it is not haunted. Something is still holding the door open.