Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

When the Code Was Right and I Was the Bug

A query that ran fine in staging and crawled in production sent me hunting through the code for an hour, when the real difference was an assumption I'd made about the data.

A terminal showing a bug

The symptom was a dashboard endpoint that took forty milliseconds in staging and eleven seconds in production. Same code, same deploy, same query. When something is that much slower in one environment and fine in another, every instinct says it's an environment problem: a missing index, a config difference, a connection pool starved under real load. I spent a good hour treating it as one.

I diffed the database configs. Identical, near enough. I checked the indexes existed in production. They did. I ran EXPLAIN in staging and got a clean index scan, exactly what you'd want. I was building a case that production's planner was somehow making a worse decision, which does happen, and I was halfway to blaming statistics being out of date.

Then I ran the same EXPLAIN ANALYZE in production and actually read it instead of skimming for the word "index". The plan was the same. The index was used. The difference was the row count flowing through it: staging had a few thousand rows behind that join, production had several million. The query wasn't doing anything different. It was doing the same correct thing to a hundred times more data, and the bit of the plan that was linear in row count was the bit that had quietly become the whole cost.

Source code on a screen

My assumption, never stated, was that staging was representative. That because the schema matched and the query matched and the indexes matched, a result that was fast in staging would be fast in production by the same margins. Staging had been seeded with a tidy, small, well-behaved dataset years ago and nobody had ever made it lopsided the way real data is lopsided. So the query that filtered down to a handful of rows in staging filtered down to a small city's worth in production, and the part of the plan I'd glanced past, an in-memory sort that fit comfortably on a few thousand rows, was spilling to disk on millions.

The actual fix was boring once I understood it: a composite index that let the database filter and order in one pass, so the sort never had to materialise the whole set. Forty milliseconds in both environments after that. But the fix isn't the point. The point is that for an hour I was debugging a problem that didn't exist, a phantom production-only misconfiguration, because I'd assumed the two environments were the same in the one dimension that actually differed. The code was right the entire time. The bug was that I'd let "it works in staging" mean more than it was ever entitled to mean. Now I check the row counts before I check anything clever.