The Bug Was in My Assumptions, Not the Code

A terminal showing a bug

The report was that a handful of customers had been charged twice. Not many, maybe one in a few thousand, which is somehow worse than all of them, because it means the bug is conditional and you have to find the condition. The code in question had a retry loop around a payment call, and retry loops around payments are exactly the kind of thing that double-charges people, so I went in confident I knew the shape of the answer.

I was wrong about the shape for most of a day. I spent hours convinced the bug was that the retry fired when it shouldn't, that we were retrying on a success we'd misread as a timeout. I added logging around the response parsing, I stared at the timeout configuration, I wrote a test that hammered the path with flaky responses and could not get it to double anything. The code did exactly what I'd have predicted. Retry on timeout, don't retry on a clear success, the obvious correct thing.

The assumption I never questioned was the one underneath all of it: that the downstream call was idempotent. We passed an idempotency key with every request, so a retry of the same payment would be deduplicated by the provider and charged once. That was the whole reason the retry loop was considered safe. I'd read that line of code a dozen times and nodded at it every time.

Source code on a screen

It was not safe. The key was being generated inside the retry loop instead of outside it. Each attempt got a fresh idempotency key, so from the provider's point of view two attempts weren't the same payment at all, they were two different payments that happened to be for the same amount. The dedup never had a chance. The safety mechanism we were relying on was being defeated by where one variable was declared.

The fix was three lines: hoist the key generation above the loop so every attempt of the same logical payment carries the same key. The provider's idempotency then did exactly what we'd always assumed it was doing. The double charges stopped.

What stuck with me is that I never debugged the actual bug, because I never doubted the actual assumption. I'd treated "this call is idempotent" as a fact about the world rather than a property the code had to maintain, and so it lived in a blind spot. Every test I wrote tested the part I suspected. None of them tested the part I trusted. The faster path through that whole day would have been to write down, on purpose, the things I was assuming to be true and then go verify each one, starting with the one I was most sure of. The bug is rarely where you're looking. It's usually sitting underneath the thing you didn't think was worth checking.