The service wasn't down. That was the problem. If it had fallen over we'd have noticed in minutes. Instead it just got slower, a little each week, in the patient way that nobody pages you for, until one afternoon a routine page load took eleven seconds and somebody finally said something out loud.
This is the story of one query that was quietly killing us, and it's a familiar one: a thing that was completely fine at launch and became a liability only because the data underneath it grew while nobody was watching.
Slow is harder than broken
A crash has a timestamp. You bisect, you find the deploy, you roll back, you go home. A gradual slowdown has none of that. There's no single commit to blame, because the code didn't change. The data did. The query that ran in two milliseconds against ten thousand rows runs in two seconds against ten million, and the line in between is so gentle that no alert ever fires.
Our dashboards were averages, which is its own crime. Average response time looked merely "a bit worse". It was the p99 that was on fire, and a small slice of requests, the ones that happened to hit the worst case, were taking long enough to tie up connection pool slots and drag everything else down with them. Averages hide exactly the tail that hurts you.
Finding it
The first useful move was to stop looking at application metrics and ask the database directly. Postgres will happily tell you where the time is going if you turn on pg_stat_statements and sort by total time:
SELECT
substring(query, 1, 60) AS query,
calls,
mean_exec_time,
total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
One row dwarfed the rest. A lookup on an events table, called constantly, with a mean execution time that had crept into the hundreds of milliseconds. Calls times mean time, and it was eating the database alive, one harmless little query at a time.
Then EXPLAIN ANALYZE, which is the moment of truth:
Seq Scan on events (cost=0.00..245831.18 rows=1 width=84)
(actual time=412.331..1180.992 rows=1 loops=1)
Filter: (account_id = 41982 AND status = 'pending')
Rows Removed by Filter: 9883402
There it is. Seq Scan, and "Rows Removed by Filter: 9,883,402". To find a handful of pending events for one account, Postgres was reading the entire table and throwing nearly ten million rows away every single time. At launch the table had a few thousand rows and a sequential scan was instant, so nobody added an index, because why would you. The table grew. The scan didn't get an alert. It just got slower.
The fix, and why it wasn't quite the obvious one
The obvious fix is an index, and that's right, but the naive one isn't the best one. An index on account_id alone helps, but the query also filters on status, and the rows we want are a tiny fraction (status = 'pending') of a column that's mostly other values. A partial composite index targets exactly the working set:
CREATE INDEX CONCURRENTLY idx_events_pending
ON events (account_id)
WHERE status = 'pending';
Two things matter here. CONCURRENTLY, so building the index doesn't take a write lock on a live, busy table, it's slower to build but it doesn't take an outage to deploy. And the partial WHERE status = 'pending' clause, which keeps the index small and means it only contains the rows we ever actually look for. The index is a fraction of the size of a full one on the same columns, and it stays hot in memory.
After it landed the EXPLAIN changed to an Index Scan, the query dropped from over a second to well under a millisecond, and the p99 on the whole service fell off a cliff in the good direction. The eleven-second page load went back to feeling instant.
What I actually took from it
The fix took ten minutes. Noticing took two years, and that's the real lesson.
A few things I've changed since:
- Alert on p99, not averages. The average was lying to us the entire time; the pain was always in the tail.
- Keep
pg_stat_statementson everywhere. It costs almost nothing and it is the first place I look now, before I touch a single line of application code. - Treat "fine at current scale" as a dated statement. Every query that does a sequential scan is a time bomb with a fuse the length of your data growth. It was fine. The word doing the work in that sentence is "was".
None of this is clever. The query was simple, the fix was a one-liner, and any of us would have spotted the missing index in review if the table had been large on day one. It wasn't, and that's the trap. The slow disasters are the ones that were genuinely correct when you wrote them and only became wrong while you weren't looking.