The database wasn't falling over. That was the problem. If it had fallen over we'd have noticed in week one and fixed it in an afternoon. Instead it just got a little worse every day, the way a tyre goes down: nothing dramatic, until one morning you're sitting on the rim.
The symptom, when we finally articulated it, was that page loads on the internal dashboard had crept from "instant" to "go and make tea". Nobody could say when. Everyone assumed it had always been a bit slow. This is the most dangerous category of performance problem: the one that arrived gradually enough that the baseline moved with it.
the first wrong turn
My first instinct, and I'm not proud of it, was to blame the application. We'd shipped a fair bit recently. So I spent the better part of a day reading through the dashboard controller, adding timers, convincing myself that some N+1 in the ORM was the culprit. There were a couple of N+1s, because there always are, but fixing them changed nothing meaningful. The page was still slow.
The lesson I keep relearning: measure before you theorise. I had a theory and went looking for evidence to support it, which is exactly backwards.
actually looking
So I turned on pg_stat_statements, which I should have done first, and waited an hour for it to collect something representative. The output is unglamorous and enormously useful: every normalised query, how many times it ran, total time, mean time. You sort by total time and the truth is usually sitting in the top three rows.
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
There it was. One query, called on every dashboard load, accounting for something absurd like 70% of total database time. Mean time north of two seconds. It was an activity feed: "show me the most recent events for things this user can see", with a join, a filter on a status column, and an ORDER BY created_at DESC LIMIT 50.
I pulled it out and ran EXPLAIN ANALYZE against it directly.
EXPLAIN ANALYZE
SELECT e.*
FROM events e
JOIN memberships m ON m.org_id = e.org_id
WHERE m.user_id = $1
AND e.status = 'visible'
ORDER BY e.created_at DESC
LIMIT 50;
Sequential scan on events. The whole table. Then a sort. Then a limit that threw almost all of it away. The planner was reading every row, sorting the lot, and handing back fifty. Early on, when events had a few thousand rows, this was free. By December it had several million, and "scan everything and sort" is a line that goes up and to the right with the row count. Nothing in the code had changed. The data had.
the fix, and the boring truth about it
There was no index that matched the query's actual access pattern. There was an index on created_at alone, and one on org_id, but nothing that let Postgres walk rows in the order it needed whilst filtering on status. So it gave up and scanned.
The fix was a composite index, ordered to match the query:
CREATE INDEX CONCURRENTLY idx_events_org_status_created
ON events (org_id, status, created_at DESC);
CONCURRENTLY because the table was live and I didn't fancy holding a write lock on several million rows during business hours. It takes longer to build that way and it can fail and leave an invalid index behind, so you check for that afterwards, but it doesn't block writes.
After it built, the same EXPLAIN ANALYZE came back with an index scan and a mean time in single-digit milliseconds. Roughly a three-hundred-fold improvement on the worst query in the system, from one CREATE INDEX. The dashboard was instant again. People noticed within the hour and assumed we'd "added more servers". We had not.
what I took from it
A few things stuck.
The first is that slow-and-gradual is worse than fast-and-loud, because loud failures get fixed and quiet ones become the new normal. We had no alert on query latency, only on availability, so a query that doubled in time every few weeks tripped nothing until it was painful.
The second is that pg_stat_statements should be on from day one. It costs almost nothing and it turns "the site feels slow" into "this exact query, this many times, this much time". You stop guessing.
The third is more uncomfortable. The query was fine when it was written. The index would have been pointless then; the table was tiny and a scan was faster than an index lookup. The code rotted in place whilst the data grew underneath it. There was no bug to catch in review. The only defence is watching the actual behaviour over time, not the code at a single moment.
We've since added a simple check: a weekly job that runs EXPLAIN on our top queries by call count and flags any that have switched to a sequential scan. It's caught two more before they became a Tuesday afternoon. Cheap insurance against the slow puncture.