Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the code was fine, i wasn't

A retry mechanism that occasionally double-charged customers, traced not to a bug in the code but to an assumption that a 5xx response meant the operation had not happened.

A whiteboard covered in arrows that no longer make sense

The worst bugs I have shipped were not typos. A typo is honest. It is wrong in a way the compiler or a test or a careful read will catch, and once you see it you cannot unsee it. The bugs that have cost me whole days, and once an embarrassing amount of money, are the ones where every line of code was correct against a belief about the world that was quietly false. The code did exactly what I told it. I had just told it the wrong thing, very precisely.

This one started with a support ticket: a customer charged twice for a single order. Then another. Not many, perhaps one in a few thousand, but every single one a real person looking at their bank statement and reasonably furious. Double charges are the kind of bug that turns into a refund queue, an apology email, and a meeting where someone asks how this got through review. It got through review because the code was, line by line, fine.

the loop that looked correct

The payment flow had a retry. Of course it did. Networks fail, the payment provider occasionally times out, and a transient failure on a charge should not leave a customer unable to buy the thing they wanted. So the logic was the obvious one:

for attempt in range(MAX_RETRIES):
    response = provider.charge(order)
    if response.ok:
        record_success(order, response)
        break
    if attempt == MAX_RETRIES - 1:
        record_failure(order)

Read that on its own and it is reasonable. Try to charge. If it works, record it and stop. If it never works, record a failure. Every reviewer, me included, nodded at it. The assumption hiding inside it, the one nobody wrote down because nobody knew they believed it, was this: if charge() does not return ok, the charge did not happen.

That is false. It was always false. I just had not been forced to confront it yet.

what actually happened on the wire

When the provider timed out, our HTTP client raised, response.ok was false, and we looped and charged again. Usually fine, because usually a timeout meant the request never landed. But sometimes the request did land. The provider received it, took the money, started to send a response, and the connection dropped before the response reached us. From our side that is indistinguishable from a request that never arrived. From the customer's side it is a completed charge. Then we retried, the provider happily took the money a second time, and that one returned ok, so we recorded a single success and went home.

A sequence diagram showing a response lost after the charge succeeded

I spent the first half of the investigation reading the retry code over and over, because that is where the symptom lived and that is where my instinct said the bug must be. It was not there. The retry code was a faithful implementation of a wrong model. A timeout is not a "no". A timeout is "I don't know", and treating "I don't know" as "it didn't happen" is exactly how you charge someone twice.

The provider, to their credit, had given us the tool to do this correctly all along, and I had skimmed past it. Their API accepted an idempotency key: a value you attach to a charge so that if the provider sees the same key twice, it returns the original result instead of charging again. We had been generating a fresh key per attempt. The fix was to generate one key per order, before the loop, and reuse it across every retry:

key = idempotency_key(order)
for attempt in range(MAX_RETRIES):
    response = provider.charge(order, idempotency_key=key)
    if response.ok:
        record_success(order, response)
        break

Now a retry of a charge that already succeeded returns the first charge's result. The money moves once. The "I don't know" gets resolved by the provider, who actually knows, instead of by us guessing.

There is a subtlety worth dwelling on, because it is where I lost the most time. Even with the idempotency key, the loop had a second latent problem: if a charge succeeded but recording the success failed, say the database write timed out after the provider returned ok, we would treat the order as uncharged and the customer would, on a later attempt, be charged once more. The key saved us there too, because the second attempt presented the same key and the provider returned the original charge rather than making a new one. But it taught me to draw the boundary carefully. The idempotency key has to cover the whole "charge this order" intent, start to finish, not just the network call. Scope it too narrowly and you have moved the double-charge one step down the road rather than removed it.

I also added a reconciliation job, the unglamorous safety net. Once a day it walks our orders, asks the provider what it actually charged, and flags any mismatch. It has caught nothing serious since the fix, which is exactly what a good safety net does: sit there earning its keep by being quiet. The point is that idempotency keys prevent the duplicate at the source, and reconciliation catches the one that somehow slips through anyway, and you want both because money is the one place "almost always correct" is not a sentence you get to say.

the part I keep relearning

The code change was four lines. The real change was in my head, and it is the one I keep having to make again in slightly different clothes. When a debugging session is going nowhere, when I have read the same function fifteen times and it is stubbornly, infuriatingly correct, the bug is almost never in the code by then. It is in something I have assumed so deeply that I am not even examining it. That a timeout means failure. That a unique row can only be inserted once. That a clock only moves forwards. That a function I call returns before the next line runs.

The technique that helps is boring and it works: write the assumption down as a flat sentence, then go and disprove it on purpose. Not "is the retry logic right" but "a timeout means the charge did not happen, true or false". Said out loud like that, this one falls over in about ten seconds. The trouble is you have to first notice you are making the assumption, and the whole nature of an assumption is that you don't.

So no clever ending. Just the cost of admission, paid again. The code was fine. I wasn't. And the next time I am sure the bug is in the loop in front of me, I will try to remember to ask what I have quietly decided is true before I have checked.