Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the code was fine, i was the bug

A two-day chase through a "broken" timeout that turned out to be working exactly as written, just not as I had assumed.

A terminal full of stack traces

The service was timing out. Intermittently, under no obvious load, on one specific endpoint, and only in staging. The kind of bug that makes you distrust the universe before you distrust yourself, which is precisely the wrong order.

I spent the first day reading the request handler. It made an outbound call to an internal pricing API, applied a small transform, and returned. There was a context with a five second deadline. The pricing call usually took 40ms. So a five second deadline firing made no sense, and I burned hours convincing myself the HTTP client was leaking connections, or the deadline was being inherited wrong, or some middleware was wrapping the context and resetting it. I added logging at every layer. The numbers all looked fine. The call took 40ms. The deadline was five seconds. And still, every so often, a timeout.

Logs and code on screen

The assumption I never questioned was the word "the". The outbound call. I had read the handler and decided it made one request. It made one request in the code I was looking at. But the transform step called a helper, and that helper, three commits earlier and by someone else, had grown a second call to enrich the response with currency data. That second service was the slow one. It sat behind a connection pool sized at ten, and when ten requests piled up, the eleventh waited. Not for the pricing API. For a free slot in a pool I did not know existed.

// what I thought the handler did
price := pricing.Get(ctx, id)

// what it actually did, two functions deep
price := pricing.Get(ctx, id)
fx := currency.Enrich(ctx, price) // pool size 10, no timeout of its own

The five second deadline was real. The 40ms call was real. The pool starvation under bursty traffic was real too, and it lived in the gap between what the code did and the simplified model I had built in my head on day one. Once I instrumented currency.Enrich directly, the queue wait time lit up like a bonfire.

The fix was boring: a tighter per-call deadline, a larger pool, and a metric on pool saturation so it screams next time. The lesson was less boring. I had treated my reading of the code as the source of truth and the behaviour as the anomaly. It is always the other way round. The machine is doing exactly what you wrote, with perfect fidelity, including the parts you forgot you wrote.

Now when something "makes no sense", I try to catch myself saying that word, because it almost always means I have a confident mental model that has quietly diverged from the running system. The bug was not in the code. It was in the version of the code I was carrying around in my head.