Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the code was fine, my mental model was the bug

A timeout that wasn't a timeout, a retry that made it worse, and the half-day I spent debugging a service when the real fault was a wrong assumption I'd carried for years.

A terminal full of stack traces

The alert said the payments API was timing out. Not all the time. Maybe one request in fifty, climbing slowly through the afternoon, never quite bad enough to page someone but persistent enough that the on-call channel had that low background hum of irritation. I picked it up because I'd written half the service and I was arrogant enough to think I'd find it quickly.

I did not find it quickly. I found it about six hours later, and the thing I found wasn't a bug in the code at all. It was a bug in my head, a thing I'd believed since roughly 2019 that simply wasn't true, and the code was faithfully doing exactly what I'd told it to.

what I assumed

The service calls an upstream provider. We wrap that call in a context with a deadline, retry on failure with a bit of backoff, and surface a clean error if everything fails. Standard stuff. The assumption I'd carried, without ever writing it down or checking it, was: the deadline applies to the whole operation. One request comes in, it gets five seconds total, and within that budget we might retry once or twice, but five seconds is five seconds.

That is not what the code did. That is not, on reflection, what the code could ever have done, because I'd built it out of two pieces that each looked correct in isolation.

ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()

for attempt := 0; attempt < 3; attempt++ {
    resp, err := callUpstream(ctx, req)
    if err == nil {
        return resp, nil
    }
    time.Sleep(backoff(attempt))
}

Read it slowly. The deadline is shared across attempts, which is what I wanted. But the time.Sleep between attempts is not. It sleeps regardless of how much of the budget is left. And callUpstream had its own per-call dial timeout, set elsewhere, by a different version of me, to four seconds.

what actually happened

So the real sequence, on a slow upstream, was this. Attempt zero dials, takes nearly four seconds, fails. We then sleep for the backoff, say a couple of hundred milliseconds. By now we're past four seconds of a five-second budget. Attempt one starts, the context is nearly dead, the dial fails almost immediately because the parent deadline fires mid-handshake, and we report a timeout that has nothing to do with the upstream being slow and everything to do with us having spent the budget on a single slow attempt plus a pointless sleep.

The retries weren't helping. They were actively converting "one slow upstream call" into "a timeout error", and the sleep in the middle guaranteed there'd never be room for the second attempt to do anything useful. On a healthy day the first call returned in 80ms and none of this mattered, which is exactly why it had survived in production for years.

Code on a screen, mid-debug

The reason it took six hours is that I spent five of them looking in the wrong place. I added logging around the upstream client, convinced its connection pool was exhausted. I read the provider's status page twice. I wrote a small load test to reproduce it and couldn't, because my load test used a fast mock, so it never hit the slow path that broke the assumption. I was debugging the code. The code was fine. I should have been debugging the sentence in my head that said "five seconds total, retries included, that's all sorted".

how I finally saw it

What broke the spell was nothing clever. I gave up trying to reproduce it cleanly and just added a log line recording, on every failed request, how long each individual attempt had taken and how much budget was left when the next one started. Crude. Almost insultingly crude, given the hours I'd spent on theories. But the very first slow request that came through told the whole story in three lines:

attempt=0 took=3.94s remaining=1.06s
sleep=200ms remaining=860ms
attempt=1 took=0.001s err="context deadline exceeded"

There it was. The second attempt didn't fail because the upstream was slow. It failed instantly, because there was nothing left of the deadline for it to spend, and the sleep had eaten a chunk of what little remained. The "timeout" we were reporting was self-inflicted. The upstream had been a bit slow, once, and our own retry machinery had taken that and manufactured an error out of it.

The thing I want to flag here is that the data was always available. I could have added that log line in the first ten minutes. I didn't, because I was sure the problem was somewhere else, and being sure is what stops you from looking at the cheap, obvious evidence sitting right under the failure. Certainty is expensive precisely when it's wrong.

the fix, and the lesson

The fix was small and slightly embarrassing. Derive a per-attempt deadline from the remaining budget, don't sleep if there's nothing left to sleep into, and stop setting a separate dial timeout that fights the parent context.

for attempt := 0; attempt < 3; attempt++ {
    if ctx.Err() != nil {
        break
    }
    resp, err := callUpstream(ctx, req)
    if err == nil {
        return resp, nil
    }
    if d := backoff(attempt); time.Until(deadline) > d {
        time.Sleep(d)
    }
}

Better still would be to let the context carry the whole story and drop the bolted-on timeouts entirely, which is roughly where it ended up after I'd calmed down.

The lesson isn't "be careful with timeouts", though, yes, be careful with timeouts. The lesson is that the most expensive bugs I've ever chased weren't in the code. They were in the things I was so sure about that I never thought to check. The code is honest. It does what you wrote. When it does something baffling, the baffling part is usually the gap between what you wrote and what you believed you wrote, and you can't grep for that.

I've started keeping a running note now, when I'm stuck: not "what is the code doing", but "what am I assuming, and have I actually seen it be true today". It feels a bit daft writing down "I assume retries fit inside the deadline". It would have saved me five hours.