Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

chasing a race condition for three days

A flaky test failed once in every few hundred runs, and the bug was a shared map written from two goroutines without a lock.

A terminal showing a stack trace late at night

The symptom was a test that failed roughly once in every three hundred runs. On my laptop it never failed. In CI it failed just often enough to be infuriating and just rarely enough that everyone had quietly started hitting "re-run" and moving on. That is the worst kind of bug, the one the team has already learned to tolerate.

I gave it three days. Day one was spent not believing it was a real race. I assumed it was a slow CI runner, a timeout that was too tight, some flakiness in the test harness itself. I added retries, I bumped the timeout, and of course it failed again, because the problem was never timing in that sense.

Day two I stopped guessing and reached for the right tool. The service is Go, so I ran the offending package under the race detector:

go test -race -run TestFanout -count=2000 ./internal/dispatch

It took ages and it caught it. The detector printed two stacks: one goroutine writing to a map during result aggregation, another goroutine writing to the same map from a callback fired by a different code path. No lock between them. Most of the time the two writes were far enough apart in wall-clock time that nothing collided. Occasionally they weren't, the map's internal state got corrupted, and a downstream read saw a value that should not have existed.

A close-up of code with a highlighted line

The fix was four lines: a sync.Mutex on the struct, Lock() and Unlock() around both write sites. The map was small and the contention is negligible, so a plain mutex was the right call rather than anything clever with channels or sync.Map.

What I want to remember from this is the order I should have done it in. The race detector exists precisely for this. I should have reached for it on day one instead of day two, before I'd convinced myself it was infrastructure. The "it only fails in CI" framing sent me chasing the environment when the bug was sitting in my own aggregation code the whole time. CI was not flaky. CI was simply running the test enough times to find the bug I'd written. The lesson, again, is that flaky almost always means racy, and the tool that proves it is one flag away.