The symptom was a read replica sitting at 90% CPU for no reason we could explain. Traffic was flat. Nothing had deployed. The slow query log was almost empty, because nothing was individually slow. That was the trap: we kept looking for the one bad query, and there wasn't one. There were thousands of mediocre ones.
The replica fed our internal dashboards, and one of those dashboards had a "live activity" panel that polled every five seconds. Per open tab. We had maybe forty people with that dashboard pinned on a second monitor, all day. Forty tabs, every five seconds, each firing a query that took 40ms. That's nothing. Until you multiply it.
SELECT count(*) FROM events
WHERE org_id = $1
AND created_at > now() - interval '24 hours';
No index on (org_id, created_at), so every call was a sequential scan over the last day of events, which on a busy org was a few hundred thousand rows. 40ms of CPU, hundreds of times a minute, never logged because it never crossed the slow threshold. The replica was being eaten by a thousand small bites.
pg_stat_statements told the whole story once I thought to look at it by total time rather than mean time. Sort by mean_exec_time and this query is invisible. Sort by total_exec_time and it's the top row by a mile, with a calls count that made me laugh out loud.
The fix was boring, which is how you know it's the right one.
- Add the composite index on
(org_id, created_at). The count dropped from a 40ms scan to a sub-millisecond index-only scan. - Cache the count for thirty seconds in the app, because nobody needs a per-second event count to be exact.
- Quietly check whether anyone actually watched the live panel. They didn't. We removed it.
Replica CPU went from 90% to about 12% and stayed there. No incident, no pager, just a slow leak we'd been paying for over months because each individual drip was too small to notice.
The lesson I keep relearning: a query's cost is time × frequency, and we instrument time far better than frequency. A 40ms query feels free. A 40ms query running ten thousand times a minute is a full-time job for a CPU core. Now the first thing I do when a database is busy and nothing looks wrong is sort by total time and count, not by how slow any single thing is. The villain is usually something cheap that we're doing far too often.