Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

when logrotate and a stubborn daemon disagree

A disk filled up because an application kept writing to a log file logrotate had already moved away, and the daemon flatly refused to take the hint.

A terminal on a Linux machine

The page came in at the worst possible time, which is to say any time: a box at 98% disk and climbing. The culprit was obvious from du, a single application log that had grown to several gigabytes. The confusing part was that logrotate was configured for it, ran nightly, and as far as cron was concerned had been doing its job for weeks.

It had. That was the trouble. logrotate was rotating the file every night, dutifully, and the application was completely ignoring it.

what logrotate actually does

The default and sanest rotation method is create: logrotate renames the live file out of the way, then makes a fresh empty one in its place. So app.log becomes app.log.1, and a new app.log appears. Clean and atomic.

Except the application already has the old file open. On Linux, a file descriptor points at an inode, not a path. Renaming the file changes the path but not the inode, and the open descriptor follows the inode. So the daemon happily kept writing into app.log.1, the rotated file, which it could no longer be reached by name but was very much still growing. Meanwhile the shiny new app.log sat there at zero bytes, looking innocent.

$ lsof -p $(pgrep myapp) | grep '\.log'
myapp 2841 myapp 7w REG 253,1 4831238144 1179649 /var/log/myapp/app.log.1 (deleted)

There it is. (deleted) once the next night's rotation removed the rename, but the inode lived on because the process still held it open. The space won't come back until that descriptor closes, which it never does because the daemon runs forever.

the SIGHUP that wasn't

This is exactly the problem postrotate exists to solve. The convention is that logrotate, after rotating, signals the application to reopen its log files, usually with kill -HUP. A well-behaved daemon catches SIGHUP, closes its file descriptors, and reopens them by name, picking up the new empty file.

The logrotate config did send a HUP. The application simply didn't handle it. SIGHUP's default disposition is to terminate the process, but this one had installed a handler that, as far as I could tell from the source, treated HUP as "reload config" and never touched its log descriptors at all. So the signal arrived, the app shrugged, and kept scribbling into the deleted inode.

A server in a rack

the fix, and the better fix

The immediate fix was to make the application let go of the descriptor. A full restart did it, the disk freed instantly the moment the old process exited, and gigabytes came back at once. Not graceful, but it stopped the bleeding.

For the lasting fix I changed the rotation strategy to copytruncate:

/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

copytruncate copies the current file to the rotated name and then truncates the original in place, back to zero bytes, without renaming it. The inode never changes, so the application's open descriptor keeps pointing at the same now-empty file and carries on writing from the top. No signal needed, no cooperation required from the daemon.

It isn't free. There's a small window between the copy and the truncate where lines written in that instant can be lost, and you briefly hold two copies of the file. For an application that won't reopen on HUP and that I couldn't patch, that's a price worth paying. If the app had behaved, create plus a working postrotate HUP would have been cleaner.

The lesson is an old one but it keeps catching people: rotating a log is a conversation between two programs, and it only works if both halves play along. logrotate did its part perfectly. The application just wasn't listening.