the query that worked fine until it didn't

A performance graph on a server dashboard

The query had been in production for over a year and nobody had ever looked at it twice. That is the dangerous kind. Not the query that falls over on day one and gets fixed, but the one that works perfectly, builds up a year of trust, and then quietly starts dragging the whole service down so gradually that everyone assumes it must be something else.

The first I knew of it was the p99 latency creeping up over a few weeks. Not a spike, a creep. The sort of thing where you look at the graph and think the axis must have changed. Average response was fine. The median was fine. But the tail was getting fatter and the on-call pages were starting, always at the worst time, always "the dashboard is slow", never anything you could pin down.

finding the culprit

Postgres makes this part easy if you've turned the right things on, and we had, mostly by luck rather than foresight. pg_stat_statements is the first place to look. It aggregates every query by normalised shape and tells you total time, mean time and call count. Sort by total time and the offender is almost always sitting at the top, smug.

SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

There it was. A query that ran constantly, a fairly innocent lookup joining an events table to a users table, with a mean execution time that had quietly grown from single-digit milliseconds to the better part of a second. Multiply by the call count and it was eating more database time than everything else put together.

So I pulled the actual query, ran it with EXPLAIN (ANALYZE, BUFFERS), and the plan told the whole story in one line.

Code and query output on a screen

the plan had flipped

The events table had a perfectly good index on the column we were filtering by. A year ago, the planner used it. Now it wasn't. It had switched to a sequential scan of the entire table, reading millions of rows to find the few hundred it needed, every single time the query ran.

Seq Scan on events  (cost=0.00..184213.00 rows=1 width=64)
  (actual time=0.041..612.882 rows=287 loops=1)
  Filter: (account_id = 4471)
  Rows Removed by Filter: 8420000

Rows Removed by Filter: 8420000. The planner read eight and a half million rows to throw away eight and a half million rows. The index was sitting right there, unused.

Why does a planner abandon a good index? Because the planner doesn't know what's in the table, it knows what the statistics say is in the table, and the statistics had gone stale. Specifically, the planner's estimate of how many distinct account_id values existed, and how the rows were distributed across them, was wildly out of date. The table had grown enormously and the distribution had become very skewed, a handful of huge accounts and a long tail of tiny ones. With bad statistics, the planner estimated the filter would match a large fraction of the table, decided the index wasn't worth the random I/O, and chose the sequential scan. Reasonable logic, wrong inputs.

the fix, in two parts

The immediate fix was almost insultingly simple:

ANALYZE events;

Run a fresh ANALYZE, the planner gets accurate statistics, re-costs the query, and goes straight back to the index scan. The query dropped from ~600ms to under 2ms the instant the statistics updated. The latency graph fell off a cliff in the good direction and the pages stopped.

But "run ANALYZE" is a fix for the symptom, and I have learned to be suspicious of fixes that feel that easy. The real questions are why the statistics went stale in the first place, and why nobody noticed for weeks.

Autovacuum is supposed to keep statistics current, and it runs ANALYZE automatically when a table changes enough. The default threshold scales with table size: it triggers after roughly 10% of the rows have changed, plus a fixed floor. On a table with millions of rows, 10% is hundreds of thousands of changes between automatic analyses, and our skewed distribution meant the stats could drift badly inside that window. The defaults are tuned for a generic table. Ours was not generic.

So the durable fix was to make autovacuum analyse this table more aggressively, by lowering its scale factor for that table specifically:

ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);

Two percent instead of ten. The statistics now refresh roughly five times more often on the table that actually needs it, and we leave every other table on the sensible defaults.

what i took away from it

The bug wasn't really in the query. The query was fine. The bug was an assumption baked in a year earlier, that the data would keep looking roughly the way it looked on launch day, and a planner quietly making the best decision it could on out-of-date information.

Two things stuck with me. First, slow creep is so much harder to catch than a sudden break, because a break gets investigated and a creep gets rationalised. "It's probably just more traffic." It wasn't. Get the tail latency on a graph and watch the trend, because the trend is the warning the spike never gives you. Second, EXPLAIN ANALYZE is the single highest-leverage thing you can learn about a relational database, and most of us, myself very much included, treat it as a thing to reach for only once everything is already on fire. Read the plan. The database will tell you exactly what it's doing if you ask it. Ours was reading eight million rows to find two hundred, in plain text, and had been for weeks, just waiting for someone to look.