Three days. That is what a single race condition cost me this week, and the worst part is the fix was four lines once I finally understood it. The symptom was a job that occasionally processed the same record twice, perhaps one run in two hundred, with no pattern I could pin to load, time of day, or anything else respectable.
The cruelty of a race is that observing it changes it. Every time I added a logging line, the bug went away, because the cost of formatting and writing that line shifted the timing just enough to close the window. Run it under a debugger and it never reproduced once, naturally, because single-stepping is the ultimate timing change. This is the classic heisenbug, and it teaches you a hard lesson: you cannot debug a race with the tools that perturb timing. You have to reason about it.
So I stopped poking and started reading. Two workers pulled from the same queue. The "is this already claimed" check and the "claim it" write were two separate statements with a gap between them, and under the right interleaving both workers passed the check before either did the write. A textbook check-then-act, the oldest race in the book, and I had walked past it a dozen times because in my head it was obviously atomic. It was obviously not.
The fix was an atomic claim: a single UPDATE ... WHERE status = 'pending' that returns the affected row count, so exactly one worker wins and the loser gets zero rows and moves on. No separate check at all. The check was the bug. Three days to delete a line of code, which is roughly the going rate for these things, and I have made my peace with it.