the query that was quietly killing us

A monitoring graph on a server room screen

The symptom was vague in the way the worst symptoms are. Latency on the API crept up across the day, recovered overnight, and crept up again. No single endpoint was slow. No error rate moved. The database CPU sat at a comfortable sixty per cent and never spiked. Everything was fine, and everything was getting slowly worse, which is the most annoying state a system can be in.

The query that was doing it never showed up in the slow query log, because individually it was not slow. It ran in about forty milliseconds, well under our threshold. The problem was that it ran roughly fifteen thousand times a minute, and each of those runs did a sequential scan over a table that had grown past the point where forty milliseconds was acceptable for something on the hot path. Multiply a small sin by a large number and you get a quiet catastrophe.

I found it with pg_stat_statements, which I should have reached for hours earlier than I did. The slow query log sorts by individual duration. pg_stat_statements lets you sort by total_exec_time, which is duration times call count, and that is the column that actually matters when something is killing you by volume rather than by spikes.

SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

The offender was at the top with a total_exec_time an order of magnitude above anything else, despite a perfectly respectable mean. It was a lookup filtering on a column that, it turned out, had no index. Nobody had added one because when the feature shipped the table had four hundred rows and a sequential scan was instant. Two years later the table had several million rows and the scan was not instant, but the change had been gradual enough that no alarm ever fired.

A SQL query and explain plan on screen

EXPLAIN ANALYZE confirmed it in one line: Seq Scan where there should have been an Index Scan. A single CREATE INDEX CONCURRENTLY on the filtered column, run during a quiet hour so it would not lock anything, and the mean dropped from forty milliseconds to under one. The daily latency creep vanished entirely. The database CPU fell to forty per cent and stopped pretending to be healthy.

The lesson I keep relearning: slow is not the only failure mode, and the slow query log only catches one kind of villain. The other kind is the query that is individually fine and collectively ruinous, and you will only ever see it if you measure total time rather than per-call time. I have now got pg_stat_statements ordered by total_exec_time pinned in a dashboard, because the thing that hurts you most is rarely the thing that looks worst on its own.