the night a logfile took the service down

A terminal showing a debugging session

The service stopped accepting writes around 2am, and the error in the logs was nonsense: the database complaining it couldn't write its own log, the application complaining it couldn't write its log, and the whole thing wedged in a way that looked like a database fault until you actually read what the disk was saying. It wasn't the database. It was /var. The partition was full, and a full disk fails in a hundred confusing ways before it tells you the one true thing.

The tell was df, which is always the second command I run on a sick box and should probably be the first.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        20G   20G     0 100% /var

Zero available. Once you see that, every weird error upstream of it collapses into one cause. The database couldn't write its WAL, so it refused writes. The app couldn't append to its log, so its logging layer threw, and because some genius (me, years ago) had wrapped a critical path in a log call that wasn't defensively coded, the failed write took the request down with it. A full disk is rarely the symptom you notice. It's the thing three layers underneath the symptom you notice.

A terminal with disk usage output

what actually filled it

du pointed straight at it.

$ du -sh /var/log/* | sort -h | tail -3
1.1G    /var/log/nginx
2.3G    /var/log/journal
14G     /var/log/oldapp

Fourteen gigs in a directory for an application we'd decommissioned months ago. Its logrotate config went with the service when we tore it down, but the daemon that wrote the logs hadn't fully stopped, so it kept appending to a file that nothing was rotating any more. Months of an unrotated, unwatched logfile, growing a few megabytes a day, until one quiet Tuesday night it crossed the line and took a live service with it.

Recovery was quick once I understood it. Truncate the orphaned log to reclaim the space immediately, truncate -s 0 rather than rm, because the process still held the file handle and rm wouldn't have given the space back until it released it. Then the database recovered on its own once it had room, and writes resumed. Total downtime was longer than it should have been, and almost all of that was me chasing the database error instead of looking at the disk.

the boring fixes that actually matter

Three things came out of the post-mortem, and none of them are clever.

First, alert on disk usage before it's a problem, at 80% and again at 90%, not when something falls over. We had monitoring. We did not have a disk-space alert on that host, which in hindsight is the kind of gap you only notice the night it bites you.

Second, decommissioning a service means stopping the thing that writes its logs, not just the thing that serves traffic. We removed the front door and left a tap running in the basement. A checklist item now: when you tear something down, grep for every process and cron and timer that mentions it, not just the obvious one.

Third, and this is the one I actually care about, don't let logging fail your request path. A log write is a side effect. If /var is full or the disk is read-only, the request should still complete; the log call should swallow its own error and move on, not propagate. The outage wasn't caused by the full disk alone. It was caused by the full disk meeting a code path that treated "I couldn't log this" as "I can't serve this". The disk filling was inevitable eventually. The full outage was a choice I'd made years earlier without realising it.

Disks fill. That's not an exotic failure, it's a Tuesday. The job isn't to prevent every full disk, it's to make sure a full disk is an annoyance you get paged about at 80%, not a mystery outage you debug at 2am.