a database query that was quietly killing us

A graph on a server dashboard

Nothing was on fire, which was the problem. If something falls over you get paged, you fix it, everyone goes home. This was slower and meaner: a steadily climbing database load over a fortnight, the kind that never trips an alarm because no single moment is bad enough. Just a service that felt a bit worse every day until one Monday it felt genuinely awful.

finding it

The CPU graph on the primary had a lovely smooth upward slope and absolutely no spikes to blame. So I went looking for the query that was doing the work, which on Postgres means pg_stat_statements. Order by total_time and the offender is usually obvious. It was.

SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

Top of the list, by a country mile, was a lookup that on its own looked completely innocent. A SELECT against the events table filtered by user_id. Mean time a few milliseconds, nothing alarming. The killer was the calls column next to it: this thing ran tens of thousands of times a minute, because it sat behind a hot code path that fired on every page load. A few milliseconds is fine. A few milliseconds, sixty thousand times a minute, against a table that had grown past the point where the planner gave up, is not.

Lines of code on a screen

the actual cause

EXPLAIN ANALYZE told the whole story in one word: Seq Scan. The events table had quietly grown to several million rows over the year, and the query filtered on user_id, and there was no index on user_id. So every single call was reading the entire table to find a handful of rows. For a long time the table was small enough that nobody noticed. Then it wasn't, and the scan got a little more expensive every day, and that was my smooth upward slope.

This is the most boring performance bug there is, and also one of the most common. A missing index does not announce itself. It just gets imperceptibly worse as the data grows, until one day the curve catches up with you.

the fix

CREATE INDEX CONCURRENTLY idx_events_user_id ON events (user_id);

CONCURRENTLY matters here: it builds the index without taking a lock that blocks writes, which on a live primary is the difference between a fix and a second outage. It takes longer and uses more resources, but it does not stop the world.

The query dropped from a sequential scan to an index lookup, mean time fell through the floor, and the database CPU graph did the nicest thing a graph can do, which is fall off a cliff and stay there. Load back to single digits within the minute.

The annoying part is that nobody did anything wrong. The query was written when the table was small and correct for that table. It just outgrew its assumptions silently, and the only reason it ever became my problem is that nothing alerts on "slightly worse than yesterday". I have since put a check in place for unindexed sequential scans on the big tables, so the next one of these introduces itself before the graph does.