the query that ate a node every afternoon

A latency graph over a server rack

The shape of this one was familiar and still caught me out. Every weekday at roughly half past two, one of the read replicas would spike CPU, lag would climb, and then it would all settle down again before anyone got round to looking properly. No alert tripped because nothing crossed a threshold for long enough. It was just a daily wobble that everyone had quietly decided was normal. It was not normal.

chasing the wobble

The first job was to stop treating it as a vibe and pin it to a query. Postgres makes this easy if you have pg_stat_statements loaded, which we did, so I snapshotted the stats either side of the spike and diffed by total time.

SELECT query, calls, total_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

One statement was doing almost nothing in call count and an enormous amount in total time. It was a reporting query feeding an internal dashboard that someone had set to auto-refresh. A human opened it after lunch, the refresh kicked in, and the query went off to aggregate a few million rows with a GROUP BY over a column that had no usable index.

EXPLAIN output showing a sort spilling to disk

EXPLAIN (ANALYZE, BUFFERS) told the rest of the story. The aggregate was sorting on disk, external merge, tens of thousands of temp blocks written every run. On a replica with a modest work_mem that sort never fit in memory, so it spilled, hammered the disk, and dragged everything sharing the box down with it. The query itself returned a tidy little table of daily counts. The cost of producing that table was a replica falling over.

the fix, and the better fix

The immediate fix was unglamorous: an index that matched the grouping and a bump to work_mem for that role so the sort had a chance of staying in memory.

CREATE INDEX CONCURRENTLY idx_events_day
ON events (event_date, account_id);

That got the afternoon spike down from a genuine incident to a faint ripple. But the real fix was admitting that recomputing a daily aggregate from raw rows on every dashboard refresh is a silly thing to do. The numbers only change once a day. So the counts now land in a small summary table written by a nightly job, and the dashboard reads from that. The expensive query runs once, at three in the morning, when nobody is looking and nothing else needs the disk.

what I took away

The lesson is the same one I keep relearning. A query that is slow but succeeds is more dangerous than one that fails, because failure gets fixed and slowness gets tolerated. This one had been tolerated for months, hiding inside a daily rhythm that everybody had stopped questioning.

The other half of it is that pg_stat_statements is the cheapest insurance you can buy. It costs almost nothing to leave running and it turns "the database feels slow after lunch" into a query you can name, explain, and fix in an afternoon. I should have looked at it the first week, not the third month.