the outage that was just a full /var

A terminal showing an error being traced

The page said the API was returning 500s. The application log said it could not write its session file. The database log said a client had disconnected mid-transaction. PostgreSQL said it could not create a temp file. Three components, three different complaints, and for ten minutes I treated them as three different problems.

They were one problem. df -h showed /var at 100 percent, and every one of those errors was the same disk refusing to accept another byte, just dressed in three different vocabularies. A log file had been quietly growing for weeks because something I had set up months ago was logging at debug level and nobody had told logrotate about it.

The fix took one command once I knew where to look: truncate the offending log, confirm the services recovered, then actually configure rotation so it could not happen again. The embarrassing part is how long I spent reading the application stack trace as if it held the answer, when the answer was a single percentage in df.

I have since added a check that pages me at 85 percent rather than at 100. A full disk is the most boring outage there is, and it announces itself loudly if you are listening on the right channel. The trick is to look at the dull system-level things first, before you go diving into the clever application-level ones. The disk is full long before the database starts complaining about temp files.