The query wasn't slow. That's what made it so annoying to find. Every individual run finished in under two milliseconds, well inside anyone's idea of fine. The problem was that we ran it roughly a million times an hour, and nobody had noticed because no single trace ever looked alarming.
The symptom was vague: database CPU sat around 70% at times of day that didn't match our actual traffic. Load would climb on a quiet Sunday morning and we'd shrug. The dashboards showed plenty of headroom on every individual endpoint. Nothing was on fire, which is exactly the kind of problem that survives for months.
What gave it away was pg_stat_statements, which I should have looked at far sooner. Sorting by total_exec_time rather than mean_exec_time changed the whole picture. The slowest queries by mean were big analytics reports run a few times a day, harmless. But near the top by total time was a tiny lookup nobody had thought about.
SELECT id, status FROM accounts WHERE external_ref = $1;
It was being called once per item inside a loop that processed batches of several hundred items, and those batches ran constantly. Classic N+1, hiding behind an ORM that made each call look like a property access rather than a round trip to the database. The query itself was perfect. There was an index on external_ref. It used it. It returned in microseconds. It just did so a hundred million times more often than anyone realised.
The fix was to fetch the whole batch in one statement:
SELECT id, status, external_ref FROM accounts WHERE external_ref = ANY($1);
Pass the list of refs as an array, get back a map keyed by ref, look each one up in memory. One round trip instead of several hundred. Database CPU dropped from a steady 70% to about 22% within ten minutes of the deploy, and the Sunday-morning climb vanished entirely.
The lesson isn't really about Postgres. It's that mean latency lies to you. A query that's fast and frequent can cost far more than a query that's slow and rare, and most monitoring is biased towards the slow-and-rare because that's what trips alerts. If you only ever look at the p99 of individual requests you will never see this. Total time, grouped by statement, is the view that tells the truth.
We've now got a panel on the database dashboard that's just the top ten by total execution time, refreshed hourly. It's not pretty. But it would have caught this in a day instead of a quarter, and that's worth an ugly panel.