three days hunting a race condition that only existed under load

A terminal full of log output with a bug lurking somewhere in it

The bug report was four words: "orders processed twice sometimes." Sometimes. The worst word in the English language when it's attached to a defect, because it means there's a pattern and you don't know it yet, and "sometimes" is precisely the kind of thing that doesn't reproduce on your laptop, in staging, or anywhere you can attach a debugger and feel in control. It only happened in production, under real traffic, perhaps a few times a day out of tens of thousands of orders. Just frequent enough to matter, just rare enough to be untraceable. I lost three days to it. Here's the whole sorry trail, because the shape of it is more useful than the fix.

Day one: trying to make it happen

The first instinct with an intermittent bug is always wrong, and mine was no exception: I tried to reproduce it by hammering the thing locally. I wrote a little script to fire a few hundred concurrent order submissions at a local instance and waited for the duplicates. Nothing. Ran it again with more concurrency. Still nothing. The local box was too fast, too clean, too unloaded, and the conditions that triggered the bug simply weren't present.

What I should have done on day one, and what I eventually did on day two, was stop trying to reproduce it and start trying to observe it. But on day one I was still convinced I could trap it in a test. I couldn't. The most useful thing that came out of the day was a slightly better understanding of the order flow, which on paper looked perfectly safe. A request comes in, we check whether we've already seen this order ID, and if not we process it and record that we've seen it. Check-then-act. There it is, in hindsight, the classic shape of every race condition ever written. But the check and the act were both hitting the database, and the database had a unique constraint, so surely it was safe. Surely.

A wall of code on screen, the relevant lines not yet obvious

Day two: stop guessing, start logging

Day two I gave up on reproduction and instrumented the live path. Not a debugger, you can't debug production traffic, but structured logging at every decision point: order ID, the result of the "have we seen this" check, a timestamp to the microsecond, and crucially the thread or worker ID handling each request. Then I waited for the duplicates to happen on their own and went looking for them in the logs afterwards.

When the next double-processing landed, the logs told a story I should have predicted. Two requests for the same order ID, arriving roughly forty milliseconds apart, handled by two different workers. Both ran the "have we seen this order" check. Both got back "no." Both proceeded to process. The unique constraint in the database did eventually fire and reject the second insert, but by then the side effects had already happened twice: an email sent, an external payment-capture call made, a downstream event published. The constraint protected the row. It did absolutely nothing for everything we'd already done before we tried to write the row.

Two log lines for the same order ID, microseconds apart, on different workers

That's the thing about check-then-act that bites you. The window between the check and the act is small, but under load with retries and a client that resubmits on a slow response, "small" is plenty. The forty-millisecond gap was the retry: the client had timed out waiting for the first request, given up, and fired a second. Two requests, same order, in flight at once, each blind to the other.

Day three: why was it only sometimes?

So I had the mechanism. But the question that still nagged was the "sometimes." Retries happen all the time. Why did only a handful turn into duplicates? This is where it got genuinely interesting, and where I learned something I'll carry for years.

The slow responses that triggered the client retries weren't random. They clustered. And when I correlated the duplicate events against our outbound HTTP metrics, they lined up with moments when our connection pool to the payment provider was exhausted. We shared a single HTTP client across the whole service, with a connection pool capped lower than it should have been. When traffic spiked, requests queued waiting for a free connection. That queuing made those particular requests slow. Slow requests triggered client-side retries. Retries created the concurrent pair. The pair raced the check. The race produced the duplicate.

Every layer was individually defensible. The connection pool limit was there for a sensible reason. The client retry was there for a sensible reason. The unique constraint was there for a sensible reason. None of them was wrong on its own. They only became a bug when they lined up, which is why it was "sometimes": it needed a load spike, a pool exhaustion, a retry, and a race, all in the same few milliseconds. Take any one away and nothing happens.

The fix, and the better fix

The immediate fix was to make the operation idempotent at the point that mattered, not at the database row but at the side effects. We took a short-lived distributed lock keyed on the order ID before doing any of the irreversible work, so the second request blocks, sees that the first has completed, and bows out cleanly. Belt and braces, we also made the downstream calls idempotent where the providers supported an idempotency key, which the payment provider thankfully did.

The better fix, the structural one, was to stop pretending check-then-act against a database is safe under concurrency just because there's a constraint somewhere. The constraint guards your data. It does not guard your actions. If an action has side effects you can't take back, an email, a charge, an event, the protection has to wrap the action, not the write that happens to come after it.

I also bumped the connection pool and added an alert on pool saturation, because the saturation was the upstream cause of the whole chain and we'd been completely blind to it. Funny how a bug three layers deep in business logic turned out to be, at root, a pool that was two sizes too small.

What I'd tell myself on day one

Stop trying to reproduce intermittent concurrency bugs locally. Your laptop is too fast and too quiet to surface them, and every hour spent trying is an hour not spent observing the place the bug actually lives. Instrument the real path, log enough context to reconstruct who did what and when, and then be patient. The bug will happen again on its own, and when it does, the logs will hand you the answer that no local test ever could.

And write down the timeline of an incident like this while it's fresh, because three months from now someone will hit "orders processed twice sometimes" again, and the kindest thing you can do for them is a clear account of the last time it had nothing to do with the database constraint they're about to stare at for two days.