chasing a race condition for three days

A terminal showing a bug being traced

The worst bugs are the ones that disappear when you look at them. This one took three days, and for most of that time I was convinced the problem was somewhere it absolutely was not. It was a race condition, of course it was, but it wore a very good disguise.

The symptom was a service that occasionally returned the wrong data. Not crashed, not errored, just quietly handed a request the response that belonged to a different request. Rare. Maybe one in fifty thousand under production load, and never, ever when I was watching.

Day one: denial

My first assumption was that it wasn't my code. It's always a comforting first assumption and it's almost always wrong. I went looking at the load balancer, the connection pool, the cache, anything with shared state that I didn't write. I added logging around the request path and waited for the bad response to show up in the logs.

It didn't show up. The logs all looked correct. Each request, traced end to end, did the right thing. Which made no sense, because the client was definitely receiving wrong answers. I spent most of the day convinced I had two separate bugs, when in fact the logging itself was the clue I ignored.

Day two: the heisenbug

The tell, in retrospect, was glaring. When I ran the service under a debugger, the bug never reproduced. When I added enough logging to slow the hot path down, the bug got rarer. Anything that introduced timing changes made it harder to hit. That is the signature of a race condition written in neon, and I still spent half a day chasing memory corruption instead.

The breakthrough was load. I stopped trying to catch it gently and instead hammered the service with concurrent requests in a tight loop, with logging stripped back to almost nothing so I wasn't perturbing the timing.

# crude but effective: 200 concurrent clients, same endpoint
seq 1 200 | xargs -P 200 -I{} \
  sh -c 'for i in $(seq 1 500); do curl -s localhost:8080/lookup/{} ; echo; done' \
  | sort | uniq -c | sort -rn

With that running, the mismatched responses appeared within seconds. I finally had a reliable reproduction, and a reliable reproduction is ninety per cent of the fight.

Lines of code under inspection

Day three: the actual bug

With reproduction in hand the cause fell out fast. There was a shared map used as a per-request scratch cache, and somewhere in a refactor months earlier the lock around it had been quietly dropped on one code path. Under low concurrency the window was so small it effectively never happened. Under real load, two requests would write and read the same map concurrently, one would see the other's entry, and you'd get a response built from someone else's data.

The fix was three lines. It nearly always is.

// before: shared map, no protection on this path
results[req.ID] = lookup(req.Key)

// after: guard the shared state
mu.Lock()
results[req.ID] = lookup(req.Key)
mu.Unlock()

The better fix, which I did afterwards, was to stop sharing the map at all. The scratch cache had no business being shared between requests in the first place; making it request-local removed the need for the lock and removed the entire class of bug. Shared mutable state you don't need is just a race condition you haven't hit yet.

What I should have done on day one

The tooling existed the whole time. Go's race detector would have flagged this in minutes, and I ran the test suite under -race only on day three, out of desperation, after I'd already found it by hand.

go test -race ./...

It lit up immediately, pointing at the exact line. Three days of work that a flag would have done before lunch on day one.

The lesson I keep relearning: when a bug vanishes under observation, stop adding observation. It's telling you the bug is about timing, and your job is to make it more concurrent, not more visible. Reach for the race detector early, distrust shared state, and don't assume it isn't your code. It's usually your code.