Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the query that was fine until it wasn't

A slow Postgres query that looked harmless in isolation was quietly burning the database alive under load, and finding it meant trusting the numbers over the obvious suspects.

A performance graph on a server

The symptom was vague in the worst way. Once or twice a day the application would go sluggish for a few minutes, request latency would climb, a handful of timeouts would fire, and then it would recover on its own before anyone could get a useful look. No errors to speak of. No deploy correlating with it. The kind of thing that gets a Slack message reading "anyone else seeing this?" and then nothing, because by the time three people have looked it's gone.

The thing that kept it alive for weeks was that everything looked fine in isolation. CPU on the app servers was unremarkable. The database host wasn't pegged. Connection counts were within normal range. Every individual graph said "nothing to see here", which is exactly how a death by a thousand cuts hides.

following the load, not the suspicion

I had a strong hunch it was a recent feature, because I always have a strong hunch and it's usually wrong. So I made myself ignore it and go to the data. The first genuinely useful move was turning on log_min_duration_statement in Postgres, set to 500ms, so the database itself would tell me which statements were slow rather than me guessing from the application side.

log_min_duration_statement = 500
log_line_prefix = '%m [%p] %u@%d '

Within a day the log had a clear repeat offender. One query, not the one I suspected, showing up slow specifically during the bad windows. On a quiet system it ran in 40ms. During an incident it was taking eight or nine seconds, and there were dozens of them stacked up at once.

That stacking was the real story. The query wasn't slow because it was complex. It was slow because under a particular pattern of concurrent traffic, lots of instances of it piled up at the same time, each one doing a sequential scan over a table that had grown past the point where that was acceptable. Individually cheap, collectively a stampede.

Query plan output on screen

EXPLAIN ANALYZE made it embarrassingly obvious once I was looking at the right query. A sequential scan over a couple of million rows, filtering on a column that had no index because when the table was new it had a few thousand rows and nobody cared. The planner was doing exactly what you'd expect. We'd just quietly crossed the threshold where "exactly what you'd expect" became "scan the whole table, repeatedly, under load".

the fix, and the part that actually mattered

The literal fix was one index.

CREATE INDEX CONCURRENTLY idx_events_account_created
    ON events (account_id, created_at);

CONCURRENTLY so it didn't take a lock that would have caused the exact outage I was trying to prevent. After that the query dropped to single-digit milliseconds and the bad windows simply stopped. The next week's worth of latency graphs were boringly flat, which is the best outcome any of this work can hope for.

A second query plan view

But the index isn't the interesting part. The interesting part is why it took weeks. A few things conspired:

  • The pain was intermittent and self-healing, so it never sat still long enough to investigate live.
  • Every top-level metric looked healthy, because the problem was a brief concurrency stampede, not sustained saturation.
  • My first instinct pointed at the wrong place, and I'd have wasted days "fixing" an innocent feature if I'd followed it.

The thing that broke the deadlock wasn't cleverness, it was making the database narrate its own slow queries and then believing the log over my own theory. log_min_duration_statement is free, it's been there forever, and I'd somehow never turned it on for this service. Now it goes on by default, threshold tuned so it's signal rather than noise.

There's a wider lesson I keep relearning. Systems rarely fail by falling off a cliff. They fail by slowly drifting past a threshold that nobody is watching, a table that grew, a query that used to be fine, a margin that quietly evaporated. The query didn't change. The world around it did. The only defence is measuring the thing itself rather than the proxies around it, and being willing to be wrong about your first, most confident guess.