three days chasing a bug that only existed sometimes

A bug crawling across a terminal screen

It failed about one run in a thousand. A background job would occasionally process the same record twice, which downstream meant a duplicate charge, which meant the kind of bug you cannot shrug at. The maddening part was that it never failed when I was watching. Every time I ran it by hand it behaved perfectly, which is the calling card of a race condition: the bug lives in the timing, and the act of observing changes the timing.

I lost the first day to denial. I assumed it was a logic error somewhere, some path where the deduplication check was being skipped, and I read the same forty lines of code until they stopped meaning anything. There was no bug in the logic. The logic was correct. It was correct in a way that assumed only one worker would ever look at a given record at a time, and that assumption was the bug.

The second day I gave up on reading and started instrumenting. I added structured logging around the claim-and-process sequence with high-resolution timestamps, and I ran the job under load in a loop overnight. By morning I had three failures captured, and the logs told the story in a way the code never could: two workers had both read the record as "unclaimed" within the same handful of microseconds, both decided it was theirs, and both processed it. Classic check-then-act, with nothing between the check and the act to stop a second worker slipping through.

Log output with timestamps on a dark screen

The shape of the broken code was depressingly familiar:

record = db.fetch_unclaimed(record_id)
if record and not record.claimed:
    record.claimed = True          # gap here. another worker can read between fetch and write.
    db.save(record)
    process(record)

The window between reading claimed as false and writing it as true is tiny, but tiny is not zero, and at a thousand runs with several workers it eventually got hit. The fix was not clever. It was to make the claim atomic at the database level, so the check and the act happen as one indivisible operation:

UPDATE jobs SET claimed = true
WHERE id = :id AND claimed = false
RETURNING id;

If that returns a row, this worker won the claim and may process. If it returns nothing, someone else got there first and this worker walks away. The database guarantees that exactly one UPDATE can flip the row, so there is no window for two workers to both believe they won. The whole class of bug evaporates because the decision and the action are now the same statement.

The third day was spent proving it. I ran the same overnight load loop with the fix in place and got zero duplicates across several hundred thousand runs, which is the only kind of evidence I trust for a race: not a reproduction, because I could barely reproduce the failure, but a sustained absence of it under exactly the conditions that used to trigger it.

The thing I want to remember is that I wasted the first day looking for a wrong answer when the code was giving the right answer to the wrong question. The logic was not broken. The concurrency assumption underneath it was. Whenever I see check-then-act on shared state now, I assume there is a gap until I have proven otherwise, because the gap is always smaller than feels dangerous and always large enough to matter eventually.