Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the outage caused by a full /var

A service quietly stopped accepting writes because /var filled up, and the cause was a debug log nobody had turned off after a deploy three weeks earlier.

A terminal showing a disk full error

The symptom was a service refusing new connections, with a log full of errors that all reduced to "cannot write". The application was healthy, the database was healthy, the network was healthy. The disk was not. df -h on the box showed /var at 100%, and a 100% full filesystem makes everything downstream lie to you about what is actually wrong.

The cause was embarrassing in the way these usually are. Three weeks earlier we had bumped a service's log level to debug to chase a different problem, and nobody had turned it back down. Debug logging on a busy service writes a lot, and logrotate was rotating but keeping enough history that the steady-state size crept up to the partition ceiling. It did not fall over on the day we changed it. It fell over on a quiet Sunday when a batch job tipped it past the line.

$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/mapper/vg-var  20G   20G   20K 100% /var
$ du -xh /var --max-depth=2 | sort -rh | head
14G  /var/log/app

The fix took two minutes. truncate -s 0 the offending file rather than deleting it, because the process still had the handle open and rm would have freed nothing until a restart. Then set the log level back to info and actually commit that change so it survives the next deploy.

The real fix is the boring one. Monitor free space, not just used percentage, and alert before it matters. And treat "temporarily" turning up logging as something with a ticket attached, because temporary changes are the ones that take down a service three weeks later when everyone has forgotten they exist.