Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage was a full disk, the disk was full of logs about the outage

A service that fell over because /var filled up, and the feedback loop where the logging about the failure was what kept the disk full.

A terminal showing a bug being chased

The service went down at about two in the morning, came back on its own, went down again, came back, and kept doing that. Flapping. The on-call dashboards showed nothing useful: CPU fine, memory fine, no obvious traffic spike. The thing that should have explained it, the logs, were exactly where the problem was hiding.

The first real clue was an error in the application log that made no sense for what the app does: no space left on device. Not a database error, not a timeout, a write failing. A quick df -h told the whole story in one line.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   20G   20M 100% /var

/var was full. Completely. And /var is where the logs live.

Lines of code on screen

the loop

Here's the bit that's almost funny in hindsight. The disk filled up. A write failed. The application caught the failed write and, being a well-behaved application, logged an error about it. That log line went to /var, which was full, so that write failed too, which generated another error, which it tried to log. The harder the app tried to tell us something was wrong, the faster it filled the disk it was complaining about.

It wasn't an infinite spin (the failed log writes mostly went nowhere), but every retry of the real work generated a fresh burst of error logging, and the disk never got a chance to drain. The service would crash, restart, run for a few minutes, fill the last scraps of space again, and fall over. That was the flapping.

what actually filled it

The root cause was dull, which they usually are. A library had been bumped a few weeks earlier and its default log level had changed from warn to debug. Nobody noticed, because at normal volume debug logging is just noise you scroll past. It quietly ate disk for a fortnight until there was none left, and the failure surfaced at 2am because that's when the nightly batch job pushed volume over the edge.

fixing it and stopping it coming back

The immediate fix took thirty seconds: rotate and compress the offending log, free a few gigs, service stable. The actual fix took longer and mattered more.

# put the noisy logging back where it belonged
LOG_LEVEL=warn

# and stop trusting ourselves to notice next time

I added log rotation with a hard size cap so no single log can ever eat the disk, set /var/log on its own volume so a runaway log can't take the rest of the system down with it, and added a disk-usage alert at 80% rather than waiting for 100% to announce itself. Boring, all of it. But the lesson stuck: a full disk doesn't just stop you writing data, it can stop you finding out why anything is broken, because the diagnostics need somewhere to land too. Leave headroom, and never let the logs about the problem become the problem.