The pager went off at half two with the least helpful message it could have chosen: elevated 5xx rate on the main API, no deploy in the last twelve hours, no obvious trigger. The service had been fine all morning. Then, over about ten minutes, the error rate climbed from nothing to roughly a third of all requests, and stuck there.
No deploy meant no easy rollback to reach for, which is the version of this problem I dread, because it means the cause is something that crept rather than something that landed.
the logs that weren't
First instinct: read the logs. There weren't any new ones. The application log stopped abruptly at 14:19 and simply went quiet, even though the service was very much still running and very much still failing. A service that has stopped logging but hasn't stopped serving is a strange and unsettling thing, and it was the first real clue.
When a process can't write its log file, you start to suspect the filesystem. So:
$ df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 20G 20G 0 100% /var
There it is. /var completely full, not a byte free. Every write to a log file, every attempt to open a temp file, every spool operation, all of it failing with ENOSPC. The application wasn't crashing, it was failing to do ordinary I/O and turning those failures into 500s for any request that needed to write something.
what ate the disk
du is the obvious next move, and it pointed straight at the answer:
$ du -h --max-depth=1 /var | sort -h | tail -5
412M /var/cache
1.1G /var/lib
18G /var/log
Eighteen gigabytes of logs. A week earlier, someone had bumped a logger to DEBUG to chase an unrelated problem and never turned it back down. On a quiet day that was untidy but harmless. That afternoon a burst of traffic turned the debug firehose all the way up, and the log file grew faster than logrotate's nightly schedule could ever hope to contain. It hit the wall at 14:19, the same minute the logs went silent.
the fix, then the real fix
The immediate fix was dull and effective: truncate the offending log to reclaim space without confusing the process that still had it open.
# truncate -s 0 /var/log/app/debug.log
truncate rather than rm, because deleting a file a running process still holds open frees nothing until the process closes it. Within seconds /var had headroom, the writes succeeded, and the 5xx rate fell off a cliff. Total outage, about forty minutes, most of it spent working out that the problem was the disk and not the application.
Then the actual fixes, the ones that stop it being a 3am repeat:
- The log level went back to
INFO, and theDEBUGswitch became a time-limited thing rather than a permanent footgun. /var/loggot its own separate partition, so a runaway log can never again starve the rest of/varand take spooling and caching down with it.- And the bit I should have had all along: a disk-usage alert at 80% and 90%, well before anything fills, so the next time something starts eating the disk I hear about it as a warning at lunchtime rather than an outage in the afternoon.
The annoying part is that none of this was clever. A full disk is the most ordinary failure there is. It got me because the symptom (an API returning 500s) was three steps removed from the cause (a log level changed a week earlier), and nothing in between bothered to shout. The lesson, again, is that the boring monitoring is the monitoring that saves you. Free disk space is the least glamorous metric on the dashboard, right up until it's the only one that matters.