Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the night /var filled up and took the whole box with it

A production server fell over with no obvious cause until I found /var at 100% from a log nobody was rotating.

A terminal full of error output next to a bug

The page came in just after nine in the evening, which is the time these things always come in, never at half past ten in the morning when you are sat at your desk with coffee. One of our application servers had stopped serving. Not slow. Stopped. Health checks failing, the load balancer had pulled it out, and the on-call dashboard was a wall of red.

I want to write this one up properly because the root cause was so mundane, and the time I lost was entirely down to not checking the mundane thing first.

the symptoms made no sense

The first thing I did, like anyone, was SSH in and look at logs. Except the application had stopped logging anything new. The last lines were from a few minutes before the failure and then nothing, which is a strange thing for a process that is allegedly still running. systemctl status claimed the service was active. ps showed the process. It just wasn't doing anything useful.

I went down the obvious paths. Was it a deadlock? I took a thread dump, or tried to, and the dump failed to write. Was the database unreachable? No, another server on the same box was talking to it fine. Was it out of memory and thrashing? free -m said there was plenty of headroom. The CPU was idle. Everything I checked said the machine was healthy, and yet the application was clearly not.

Lines of source code on a dark screen

the thing I should have checked first

It was a colleague, half asleep and joining the call as a favour, who asked the question that ends most of these stories: "have you checked disk?"

I had not checked disk. I ran df -h and there it was, plain as anything:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   19G     0 100% /
/dev/sdb1       100G   38G   58G  39% /data

Root filesystem at one hundred percent, zero bytes available. The application data lived on /data which had plenty of room, so nothing there had complained. But /var was on root, and /var was full. When a process cannot write to its log file, and when it cannot create the temp files it needs, it does not always crash cleanly. It can hang, block on a write that never completes, and sit there looking alive while doing absolutely nothing. That was our deadlock that wasn't a deadlock.

what actually filled it

A bit of du got me there:

du -xh /var --max-depth=2 | sort -rh | head

The offender was a single log file. One of our libraries had been bumped to debug level in a config change a fortnight earlier, for an unrelated investigation, and nobody had turned it back down. It had been writing a few hundred megabytes a day ever since. There was no log rotation on that particular file because it was written by the application directly rather than through the system logger, so logrotate never touched it. It had quietly grown until the disk had nothing left to give, and then everything that needed to write to root stopped at once.

More source code, with a configuration block highlighted

the fix, and the real fix

The immediate fix took thirty seconds. Truncate the runaway file, free up the space, restart the service, watch it come back and rejoin the pool. The box was healthy again before I had finished apologising to the colleague who had to spell out "check disk" to me at nine in the evening.

The real fix is the part worth keeping:

  • Turn the log level back down to where it should have been, and add a note that says why, so the next person doesn't repeat the change.
  • Put that application's own log file under logrotate, with a sane size cap, so it can never do this again regardless of level.
  • Add a disk space alert at 85% on every server, firing to the same channel as everything else. We had alerts on CPU and memory and not, somehow, on the one resource that had just taken us down.

That last one is the bit that stings. We monitored the glamorous metrics, the ones that look good on a graph, and skipped the boring one. Disk space is the least interesting number on a server right up until it is the only number that matters. A full disk has caused me more genuine outages than any clever distributed-systems failure, and it always presents as something else: a hang, a silent process, a service that is up but not working.

So the takeaway, written down so I actually remember it next time the pager goes at nine in the evening: before you reach for thread dumps and heap analysis and the exotic explanations, run df -h. It takes one second. If I had run it first I would have been back in front of the telly forty minutes sooner, and with rather more dignity intact.