Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the query that was quietly killing us

How one missing index turned a routine list endpoint into a creeping database outage, and how we found it.

A latency graph over a server rack

Nothing was on fire. That was the problem. The dashboards were green, the error rate was flat, and yet the database had been getting slowly, steadily worse for weeks, in the way that a frog allegedly doesn't notice the water heating. Every morning the p99 on one endpoint was a little higher than the day before, and every morning I told myself it was probably the data growing and we'd deal with it later. Later arrived as a pager at the wrong time.

the symptom

The endpoint was a boring list view. Give it a user, get back their recent items, paged, newest first. Under light load it returned in a few milliseconds. Under real load it didn't, and the connection pool started backing up behind it. Once the pool was saturated, requests that had nothing to do with this endpoint started timing out too, because they couldn't get a connection. That's the nasty thing about a slow query in a shared pool: the blast radius is everything, not just the offender.

finding it

The first useful move was to stop guessing and turn on the slow query log. We set the threshold low for an hour during a quiet window:

SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 0.2;

One query dominated the log. Pulled it out, ran EXPLAIN on it, and there it was in plain text.

EXPLAIN output showing a full table scan

EXPLAIN SELECT id, title, created_at
FROM items
WHERE user_id = ?
ORDER BY created_at DESC
LIMIT 20;

The plan showed a full table scan, type ALL, rows examined in the hundreds of thousands, and Using filesort. There was an index on user_id, which is why it had been fine for ages: when each user had a handful of rows, filtering by user and then sorting twenty rows was nothing. But the table had grown, some users had accumulated thousands of items, and the planner was choosing to scan and sort rather than use the narrow index. The ORDER BY created_at DESC was the part with no index to lean on, so every request did the sort by hand.

the fix

A composite index that matches the access pattern exactly:

CREATE INDEX idx_items_user_created
ON items (user_id, created_at);

The order of the columns matters. user_id first because that's the equality filter, created_at second so the index can also satisfy the sort. With that in place the planner stopped scanning, used the index for both the filter and the ordering, and the LIMIT 20 meant it could stop after reading twenty index entries instead of sorting the lot.

The query plan after adding the composite index

The same query went from examining hundreds of thousands of rows to examining twenty. p99 on the endpoint dropped back under the threshold where it had quietly started years of bad habits, and the connection pool stopped backing up, which fixed the timeouts on every other endpoint as a free side effect.

what I actually learned

The fix was a one-line index. The lesson was not. A few things I'm trying to hold onto:

  • Slow doesn't mean broken, and that's exactly why it's dangerous. A query that fails loudly gets fixed on day one. A query that degrades gracefully gets ignored until it takes the whole pool down with it.
  • EXPLAIN is not advanced tooling. It's the first thing to reach for, not the last, and I should have run it the first morning I noticed the trend rather than the morning of the pager.
  • Indexes need to match how you actually query, including the sort, not just the filter. An index on user_id looked correct and was correct, right up until the ORDER BY became the expensive part.
  • A slow query in a shared connection pool is everyone's problem. Isolate it if you can, but mostly just don't have one.

None of this is clever. The clever bit, if there is one, is noticing the slow creep before it becomes an incident, and I plainly didn't. So now there's a graph of p99 per endpoint with an alert on the trend, not just the absolute value. The next frog gets to climb out of the pot.