Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the query that was fine until it wasn't

A query that ran in milliseconds for two years and then started taking down the database, and how a missing index plus a planner flip explained the whole thing.

A performance graph on a server monitoring screen

The query had been in production for two years. It was fine. Then one Tuesday it was not fine, and it took the database with it, and the most maddening part was that nobody had touched the query, the code around it, or the schema. It had simply turned from a millisecond lookup into a thirty-second table scan overnight, the way these things do, with no human to blame.

Here is the short version, because the lead matters: the table had grown past the point where the planner's assumptions held, an index that had been quietly carrying the whole thing was no longer being chosen, and the row estimates were wildly wrong because the statistics were stale. None of that is exotic. All of it is invisible until the day it is the only thing that matters.

How it showed up

Not as a slow query. That would have been too kind. It showed up as the database periodically falling over under connection pressure, every connection blocked, the application timing out, and the on-call dashboard a wall of red with no single obvious cause. The query that was actually responsible was buried in the noise because individually it was not always slow. It was slow under one particular combination of parameters that had become more common as the data grew.

That is the trap with this class of problem. The aggregate metrics look like a capacity issue, so you reach for more capacity, and the bigger box buys you a week before you are back where you started, poorer and no wiser.

A screen showing a database query execution plan

Finding it for real

The thing that broke it open was looking at the slow query log properly rather than at the dashboards, and then running EXPLAIN ANALYZE on the offending statement with realistic parameters. Not EXPLAIN, which gives you the plan and the estimates. EXPLAIN ANALYZE, which actually runs it and shows you estimated rows against actual rows, side by side, so you can see exactly where the planner is lying to itself.

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM events
WHERE account_id = $1
  AND created_at >= $2
ORDER BY created_at DESC
LIMIT 50;

The output told the whole story. The planner expected to match a few hundred rows and chose a sequential scan, reasoning that for a few hundred rows out of a table it believed was small, a scan was cheaper than an index lookup. The actual row count was in the millions, and the table was no longer small. Estimated rows: a few hundred. Actual rows: north of two million. When the gap between those two numbers is three or four orders of magnitude, you have found your problem, and it is almost always statistics.

The composite index on (account_id, created_at) that should have served this query existed. The planner just was not choosing it, because based on its stale statistics it genuinely believed the scan was faster. It was optimising correctly for a world that no longer existed.

The fix, in order of how much it helped

First, ANALYZE on the table to refresh the statistics. This alone made the planner start choosing the index again immediately, and the thirty-second query dropped back to single-digit milliseconds. That is the whole incident resolved, in one command, which is both satisfying and a little embarrassing.

Then the real fix, which is making sure it does not silently happen again:

  • Turned up autovacuum's aggressiveness on that table, because the default scale factor means a big table gets analysed proportionally less often as it grows, which is exactly backwards from what you want.
  • Added an index that better matched the access pattern, so the planner had a more obviously correct choice and less room to second-guess itself.
  • Put the slow query log somewhere I would actually see it, with a threshold low enough to catch the next one while it is annoying rather than fatal.
ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);
ANALYZE events;

What I took from it

The query did not change. The data did, and the planner's model of the data did not keep up, and that gap sat there as a loaded gun for months before someone pulled the trigger by happening to query a large account. Nothing was wrong with the code. Everything was wrong with the assumptions underneath it.

So now, when something that was fine forever is suddenly not fine, and nobody changed anything, my first question is not "what did we deploy" but "what grew". Tables grow. Cardinalities shift. The plan the optimiser picked when the table had ten thousand rows is not the plan it should pick at ten million, and unless the statistics are kept honest it will go on picking the old one until the day it cannot.

EXPLAIN ANALYZE with real parameters, estimated rows against actual rows. That one habit has found more of these than any dashboard ever has. The dashboard tells you the patient is sick. The plan tells you why.