a database query that was quietly killing us

A database performance graph

It wasn't a slow query. That's what made it sneaky. Run it once and it came back in a few milliseconds, well under any threshold that would page anyone. The problem was that it ran a few thousand times a minute, and a few milliseconds times a few thousand is a chunk of a database doing nothing useful.

We found it the boring way: turned on slow query logging with the threshold dropped low, then sorted by total time rather than per-call time. The worst offender by total cost wasn't the big reporting query everyone suspected. It was a tiny lookup buried in a hot code path, firing on every request.

EXPLAIN ANALYZE told the rest of the story. Sequential scan on a table that had quietly grown past a million rows, filtering on a column with no index. Cheap when the table was small, linear and merciless once it wasn't:

CREATE INDEX CONCURRENTLY idx_events_user_id ON events (user_id);

CONCURRENTLY so it didn't lock the table whilst building. After that the plan flipped to an index scan, per-call time dropped from milliseconds to microseconds, and the aggregate load on the box fell off a cliff. Same query, same code, one index.

The takeaway: rank queries by total time, not per-call time. The thing killing you is rarely the one that's slow. It's the one that's fast and runs constantly.