the query that was bleeding us a hundred milliseconds at a time

A line graph climbing on a server dashboard

There was no single bad query. That was the problem. The slow query log was empty because nothing crossed the threshold, and the dashboards looked fine until they suddenly didn't. We were falling over at peak and nobody could point at the cause.

The culprit was a join that ran on every page render to fetch a "you have N unread items" badge. On its own it was four milliseconds. Cheap. Except it ran uncached for every authenticated request, and the table it joined against had quietly grown past the point where the index still helped. At low traffic that four milliseconds was invisible. At peak, multiplied by every request, it pinned the database and everything queued behind it.

The fix was embarrassingly small. We added a covering index, then cached the count for thirty seconds in Redis because nobody needs their unread badge to be accurate to the second. Peak database CPU dropped by about half.

The lesson I keep relearning: the dangerous query isn't the slow one you can see in the log, it's the fast one you run ten thousand times. Look at total time spent, not worst case. pg_stat_statements ordered by total_time would have found this in about a minute, and I'll be checking it first next time.