The query was fine for eighteen months and then it wasn't. Nothing changed in the code. What changed was the table grew past some invisible threshold, the planner quietly switched strategy, and one endpoint went from 40ms to 4 seconds, then started timing out under load and taking the connection pool down with it. By the time it was a fire, it had been smouldering for weeks.
This is the most boring and most common kind of performance problem, and I want to write down the actual debugging path rather than just the answer, because the path is the transferable bit.
the symptom
The first signal wasn't the query at all. It was the connection pool exhausted alerts and a pile of HTTP 503s. The app server logs were full of "timeout acquiring connection". That's a misleading symptom: it points you at the pool, the app, the load balancer, anywhere but the database. The pool was exhausted because slow queries held connections for seconds, so connections piled up faster than they drained. The pool was the victim, not the culprit.
The thing that pointed me at the right place was pg_stat_activity. Querying it during an incident shows you what every connection is actually doing right now:
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
Every long-running entry was the same query. Twenty copies of it, each a few seconds deep. That's your culprit: not twenty different slow things, one slow thing run twenty times concurrently.
reproducing it honestly
The query looked innocent. A join across an orders table and a status-history table, filtered by a customer and a date range, ordered by date. In dev it returned instantly, because dev had ten thousand rows and production had forty million. You cannot reason about a plan on the wrong data volume, so the first real step was getting production-scale statistics, which here meant running EXPLAIN (ANALYZE, BUFFERS) against a replica.
EXPLAIN ANALYZE is the whole game. EXPLAIN alone gives you the planner's guess; adding ANALYZE actually runs it and shows you real row counts and timings, and BUFFERS shows how much it had to read. The line that mattered:
Seq Scan on status_history (cost=0.00..1850000 rows=12 width=48)
(actual time=0.3..3800 rows=11 loops=1)
Filter: (order_id = ANY ...)
Rows Removed by Filter: 39000000
There it is. A sequential scan over forty million rows to find eleven. "Rows Removed by Filter: 39000000" is the tell: the database read the entire table and threw nearly all of it away. The planner estimated 12 rows, found 11, so its estimate was fine. It just had no fast way to get those rows.
why it changed
The interesting question is why it was ever fast. Early on, the table was small enough that a sequential scan was genuinely the cheapest option, and the planner correctly chose it. As the table grew, a sequential scan got linearly more expensive while an index lookup would have stayed roughly flat. There was no index on status_history.order_id, so when the table crossed the point where a scan became ruinous, the planner had nothing better to switch to. It didn't pick a bad plan; it picked the only plan, and the only plan got slowly worse with every order we took.
the fix, and checking the fix
The fix was a single index:
CREATE INDEX CONCURRENTLY idx_status_history_order_id
ON status_history (order_id);
CONCURRENTLY matters in production: it builds the index without taking a write lock on the table, so the application keeps serving while it builds. It's slower and can't run in a transaction, but it doesn't stop the world, which is the entire point during an incident.
After the index, the same EXPLAIN ANALYZE showed an Index Scan reading a handful of buffers instead of the whole table, and the query dropped back under 5ms. Connections were released in milliseconds, the pool drained, the 503s stopped.
the lessons I actually keep
A few things from this that I now reach for by reflex:
- The loudest alert is rarely the root cause. Pool exhaustion, timeouts, and 5xxs are downstream of something. Follow them upstream.
pg_stat_activityduring an incident, andpg_stat_statementsfor the slow burn, tell you where the time goes far faster than guessing.- "Rows Removed by Filter" in a plan is a flashing light. It means you read data you didn't need, which is almost always a missing or unused index.
- Test query plans at production data volume, not dev volume. A plan is a function of statistics, and your dev statistics are a lie.
- Performance regressions don't always have a commit. Sometimes the change is just time, and the data crossing a threshold you couldn't see.
The annoying truth is this was findable at any point in those eighteen months. The index would have been free to add when the table was small. It only became urgent when it became an outage, which is the natural state of every problem nobody is paid to look for.