the outage caused by a full /var

A terminal showing an error trace over a dark background

The pager went at 03:12. The API was returning 500s, the health checks were red, and nothing had been deployed in two days. That last part is the bit that makes you nervous, because if nobody changed anything then something changed on its own, and self-inflicted is always easier to fix than self-organising.

First instinct was the database, because it usually is. The database was fine. Connections were fine, CPU was idle, the load balancer was happily routing to a node that was happily refusing to do anything useful. The app logs were the obvious next stop, except the app could not write logs, which turned out to be the entire story rather than a footnote to it.

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        40G   40G   20K 100% /

There it is. A hundred percent on /. Everything downstream of that was a symptom: the app failing to open a log file and erroring, systemd failing to write its journal, the health endpoint failing because it tried to log the request and could not. A full disk does not announce itself politely. It just makes every component lie to you in a slightly different dialect.

A code window with a stack trace and a tail of log lines

The cause was dull, which is how these always end. A debug log line had been left enabled in a retry path, and a flaky upstream had started timing out around midnight. Each timeout logged a full request body, the retry logged it again, and the backoff logged that it was backing off. One slow dependency, multiplied by an enthusiastic logger, wrote about thirty gigabytes between midnight and three.

du -xh /var | sort -rh | head -5

That command found it in seconds once I thought to ask the right question. The fix at 03:30 was to truncate the offending file, confirm the service recovered, and go back to bed. The fix the next morning was the real one: drop that log line to its proper level, put logrotate in front of the app's own files rather than trusting it to behave, and add an alert on disk usage at eighty percent so the next one pages me at a civilised hour with warning to spare.

The lesson I keep relearning is that monitoring the thing is not the same as monitoring the floor it stands on. We watched latency, error rate, queue depth, all the application-shaped numbers. We did not watch the boring physical limit underneath all of them. Disk space is the kind of metric that is fine, fine, fine, fine, and then it is the only thing that matters.

Alert on the boring stuff. It is always the boring stuff.