The worst outages aren't the ones that page you at 3am. They're the ones that arrive over six weeks, so gradually that everyone adjusts to them and nobody thinks to complain. Ours was a query that got 8ms slower every day. Nobody noticed until a customer politely asked why a page that "used to be instant" now took the better part of four seconds.
the symptom and the lie it told
Response times on one endpoint had been climbing. The dashboards showed the API p99 creeping up, but the application metrics blamed "database time" in the vague way application metrics do, and database CPU looked fine. That last bit is the lie. The database wasn't busy. It was waiting, doing an enormous amount of pointless work very efficiently. Low CPU and high latency together is a strong tell for I/O or a bad plan, not for load.
The temptation at this point is to add hardware or cache the endpoint. Both would have hidden the problem for another month and made it worse to eventually fix. So I went to the slow query log instead, which we'd had switched on the whole time and, to my mild shame, had never once looked at.
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
One query dominated the log. It was unremarkable to read: fetch a user's recent activity, filtered by a customer id and ordered by a timestamp. The sort of thing you write on autopilot.
EXPLAIN tells you the truth
EXPLAIN is the single most useful thing in this whole story, and I should have run it first. The output had the word every performance problem eventually shows you:
EXPLAIN SELECT * FROM events
WHERE customer_id = 4172
ORDER BY created_at DESC
LIMIT 50;
type: ALL
possible_keys: NULL
key: NULL
rows: 8400000
Extra: Using where; Using filesort
type: ALL is a full table scan. The database was reading every one of 8.4 million rows, throwing away all but a handful that matched the customer, then sorting the survivors on disk to honour the ORDER BY. There was no index on customer_id at all. There had never been one. It hadn't mattered when the table had ten thousand rows, because scanning ten thousand rows is free. It mattered enormously at eight million, and the cost had grown silently with the data, which is exactly why nobody caught it: the code never changed, only the table did.
That's the trap with this class of bug. A full scan on a small table is a perfectly good plan. It only becomes an outage when the table grows past some threshold, and tables grow when you aren't looking. The query you wrote in year one is a time bomb that the optimiser is happy to keep detonating a little harder every day.
the fix, and resisting the urge to over-fix
The fix was one line:
CREATE INDEX idx_events_customer_created
ON events (customer_id, created_at);
A composite index, customer first because that's the equality filter, then created_at so the index also satisfies the sort. With that in place, EXPLAIN changed to a range lookup over a few dozen rows, Using filesort vanished because the index already provided the order, and the query went from 3,800ms to under 4ms. The graph fell off a cliff in the good direction.
I had to stop myself there. The next instinct is to go index everything, which is how you trade read latency for write latency and a bloated working set. Indexes aren't free: every one of them has to be maintained on insert, update and delete. The discipline is to index for the queries you actually run and can prove are slow, not the ones you imagine you might.
A couple of operational notes from doing this on a live table. Building an index on millions of rows locks things up if you're not careful, so I did it in a maintenance window and used the online DDL path the engine offers so writes weren't blocked. And I added the index on a replica first to time it, rather than discovering the duration on production.
the actual lesson
The database was never the problem. The absence of anyone looking at it was. We had the slow query log on for over a year and treated it as a box ticked rather than a tool used. Now there's a weekly job that surfaces the top few slow queries into a channel we read, so the next creeping query announces itself while it's still merely annoying rather than after a customer has to.
Slow problems are the dangerous ones precisely because they give you time to get used to them. The cure is cheap and boring: turn on the slow query log, actually read it, and run EXPLAIN on anything that touches a table you expect to grow. Do that and the four-second page never happens, because you fixed it when it was forty milliseconds and nobody had noticed yet.