the N+1 that hid behind a fast endpoint

A latency graph beside a server

This one was quiet because it was fast. Or it looked fast: the endpoint that rendered a team's dashboard came back in around 90ms, the p99 graph was flat, and nobody had complained. The only reason I looked at all was that the database CPU graph had been drifting upward for a fortnight with no corresponding rise in request volume. More work, same number of requests, means each request is doing more, and I wanted to know what.

The endpoint loaded a list of projects, and for each project a count of open tasks, the last activity, and the owner's name. Classic dashboard. The ORM made it pleasant to write and a nightmare to read at runtime. One query to fetch the projects, then, for each project in the loop, three more queries to fetch its bits. Twenty projects on a busy team's dashboard, and that's one plus sixty queries per page load. The textbook N+1, hiding in plain sight because each of those sixty queries was a primary-key lookup returning in under a millisecond.

I turned on query logging for the endpoint and watched a single request scroll past for an uncomfortably long time.

A wall of near-identical SELECT statements in the query log

SELECT * FROM tasks WHERE project_id = 41 AND status = 'open';
SELECT * FROM users WHERE id = 7;
SELECT * FROM tasks WHERE project_id = 42 AND status = 'open';
SELECT * FROM users WHERE id = 7;
SELECT * FROM tasks WHERE project_id = 43 AND status = 'open';
SELECT * FROM users WHERE id = 7;

Note the repeated users WHERE id = 7. The same owner, fetched fresh for every project they owned, because nothing was caching within the request. So it wasn't even sixty distinct lookups, it was sixty lookups with a lot of pointless repetition. Each one cheap, the sum of them not cheap at all, and crucially the cost lived on the database rather than in the response time the dashboards were watching. The connection was held just long enough, sixty round trips of network and parse and plan, to keep a database core busy that didn't need to be.

The fix was to stop looping and let the database do the joining, which is the thing it is actually good at. Eager-load the relations, fold the per-project counts into one grouped query, and fetch the owners in a single IN rather than one at a time. One request went from sixty-one queries to three. Database CPU dropped back to where it had been a fortnight earlier, and the response time barely moved, because the response time had never been the symptom. It just got a touch faster as a bonus.

The thing I keep relearning: response time is not the same as load, and a fast endpoint can still be an expensive one. The graph that caught this was database CPU per request, not latency, and I only had that graph by luck. So now I watch query count per endpoint as a first-class metric, because an endpoint quietly firing sixty queries is going to be a problem eventually, even while it's still answering in 90ms and keeping everyone happy.