The database wasn't down. That was the problem. If it had fallen over we'd have known in seconds, the pager would have gone off, someone would have been woken up, and the whole thing would have been dealt with by breakfast. Instead it just sat there, slightly too warm, with CPU pegged at a number that was high enough to worry about but low enough that you could talk yourself out of worrying. For about three weeks I talked myself out of it.
The symptom, when I finally stopped ignoring it, was latency. Not on any one thing in particular. Everything was a bit slow. Page loads that should have been 80ms were 200ms, sometimes 600ms, occasionally a second and a half. Nothing timed out. Nothing errored. The graphs were a fuzzy band of "fine, mostly" that drifted upward over the weeks in a way you only notice if you put two months on the same axis. Which I eventually did, and then I couldn't unsee it.
finding the actual offender
The instinct is to go straight to the slowest query. That's usually wrong, because the slowest query is often something that runs once a day and nobody cares about. What you want is total time, which is the slow-ish query that runs constantly. pg_stat_statements is the right tool and I should have reached for it on day one instead of day twenty.
SELECT
substring(query, 1, 80) AS query,
calls,
round(total_exec_time::numeric, 0) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Top of the list, by a country mile, was not some gnarly reporting monster. It was a small, innocent-looking query behind the internal ops dashboard. Mean execution time about 40ms, which is nothing. But it was being called roughly every thirty seconds by every open dashboard tab, and there were a lot of open dashboard tabs, because the dashboard was the thing everyone left open all day. The total time it had burned dwarfed everything else combined.
40ms times a couple of hundred thousand calls a day is a lot of database that isn't doing anything useful.
the plan tells you everything
The query itself was the kind of thing you write without thinking. Count of events in the last hour, grouped by status, for a particular tenant. Three columns, one WHERE clause, a GROUP BY. The sort of query you'd nod at in a code review and move on.
EXPLAIN (ANALYZE, BUFFERS) told the real story.
Seq Scan on events (cost=0.00..284103.12 rows=1842 width=12)
(actual time=0.034..38.992 rows=1773 loops=1)
Filter: ((tenant_id = 4012) AND (created_at > (now() - '01:00:00'::interval)))
Rows Removed by Filter: 9214550
Buffers: shared hit=58201 read=12044
Rows Removed by Filter: nine million, two hundred and fourteen thousand. To return seventeen hundred. Every thirty seconds. The table had grown past the point where a sequential scan was cheap, quietly, over months, and nobody had been watching the one number that mattered. When the table was small the scan was free. It stopped being small in about January and nobody noticed because, again, nothing broke.
There was no index on created_at. There was no composite index on (tenant_id, created_at), which is what this query actually wanted. The table had grown an events worth of history and the planner had no choice but to read all of it, throw away 99.98% of it, and hand back the dregs.
the fix, and the boring lesson
The fix was one line.
CREATE INDEX CONCURRENTLY idx_events_tenant_created
ON events (tenant_id, created_at);
CONCURRENTLY because the table was live and I wasn't about to take a lock on the busiest table we had to fix a problem I'd let slide for three weeks. It took a few minutes to build. The same EXPLAIN afterwards:
Index Scan using idx_events_tenant_created on events
(actual time=0.021..0.388 rows=1773 loops=1)
Index Cond: ((tenant_id = 4012) AND (created_at > (now() - '01:00:00'::interval)))
Buffers: shared hit=204
38ms to 0.4ms. Seventy thousand buffers to two hundred. The CPU graph for the whole instance dropped by about a third within the hour and stayed there. The fuzzy band of page-load latency snapped back down to where it had been in December. Nobody noticed that either, which is rather the point of fixing things before they become incidents.
A few things I took away from it, mostly aimed at my past self.
The dangerous queries are the cheap ones that run constantly, not the expensive ones that run rarely. Mean latency is a liar; total time is the truth. Sort pg_stat_statements by total_exec_time and the actual problem is usually sitting right at the top looking embarrassingly obvious in hindsight.
A sequential scan that's fine today is a time bomb with a fuse the length of your table's growth curve. The query didn't change. The data did. Nothing alerted on it because "slightly slower than last month" is not a condition anyone writes an alert for, and maybe we should.
And "the database is at 60% CPU" is not a steady state, it's a warning that you've decided to read as a status. I'd looked at that number a dozen times and filed it under "busy system, busy database, fine". It wasn't fine. It was a missing index wearing the costume of a healthy load average, and it had been quietly killing us for the better part of a quarter.
The whole thing took an afternoon once I actually looked. The annoying part is that it could have taken an afternoon three weeks earlier, if I'd treated "a bit slow" as a bug rather than a mood.