Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage that was just a full /var

A confusing cascade of service failures that turned out to be one thing: a filesystem with no room left to write.

A terminal showing a bug being traced

The pager went off with three different alerts at once, which is the kind of thing that makes you assume a network problem or a bad deploy. The database was refusing connections, a queue worker had stopped acknowledging jobs, and an unrelated cron job was emailing failures. Three services, no obvious common cause, all unhappy in the same minute.

The common cause was df -h. /var was at 100%. Once you see that, the whole mess collapses into one explanation: nothing can write. Postgres could not extend its WAL, the worker could not flush its log, cron could not spool. Every service was failing in its own idiosyncratic way for the single boring reason that there was nowhere to put a byte.

The culprit, as it usually is, was logs. A misconfigured service had been writing a stack trace per request into /var/log for a fortnight, and the journal had its own slice of the same partition. A quick du -sh /var/* | sort -h pointed straight at it.

The fix was thirty seconds: truncate the runaway log, free the space, and the three services recovered on their own without a single restart, which is always a faintly magical thing to watch. The actual fix came after: a size cap on the offending log and an alert on disk usage at 85%, so the next time it fills, something tells me before everything does.