It took three days, and the worst part is that I knew on day one what kind of bug it was. A race condition. They announce themselves: it happens in production and not on my machine, it happens under load and not in a test, and the very act of trying to observe it closely makes it go away. The moment a bug behaves better when you watch it, you are no longer debugging logic, you are debugging timing, and timing does not care about your breakpoints.
The symptom was duplicate records. Occasionally, under real traffic, an operation that should have created exactly one row created two. Not often. Maybe once in a few thousand. Often enough to matter, rare enough to be invisible in any test I could write.
The code looked obviously correct, which is the tell. It checked whether a record existed, and if it didn't, created it. Check, then act.
existing = cache.get(key)
if existing is None:
record = create_record(key)
cache.set(key, record)
Read that again the way two threads read it at once. Request A checks the cache, finds nothing. Before A writes, request B checks the same cache, also finds nothing. Now both of them believe they are the first, and both create. The window between the check and the act is tiny, microseconds, but production runs an enormous number of microseconds per day and eventually two requests for the same key land in exactly that gap. The bug was not in the code I was staring at. It was in the space between two lines of it.
The reason it took three days and not three hours was that I spent the first two trying to reproduce it locally, which is precisely the wrong instinct for this class of bug. A debugger serialises everything. Adding logging slows the hot path just enough to slam the window shut. Every tool I reached for to see the bug was a tool that prevented the bug. I was, in effect, looking for something that only exists when nobody is looking. I finally made progress by stopping trying to watch it and starting to reason about it: I drew the two interleavings on a whiteboard and the gap was suddenly obvious, embarrassingly so.
The fix was not to be cleverer about the check. There is no amount of cleverness in application code that closes a check-then-act window, because the window is fundamental to doing two things that aren't one thing. The fix was to make the database enforce what I actually meant, which was at most one row for this key: a unique constraint, and code that treats the resulting conflict on insert as "fine, someone else got there first" rather than an error.
ALTER TABLE records ADD CONSTRAINT uq_records_key UNIQUE (key);
Now both requests can race all they like. One wins the insert, the other gets a constraint violation, catches it, and reads back the row the winner created. The race still happens. It just no longer has a wrong outcome to reach.
The lesson I relearn every time: a race condition is not a bug you fix by looking harder, because looking harder is what hides it. You fix it by finding the place where the two operations actually conflict and making that place atomic, usually further down the stack than you'd like, usually the database. And you stop trusting any check that is followed, however closely, by a separate act.