Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

three days for one missing lock

A flaky integration test that failed roughly one run in forty turned out to be a missing lock around a shared counter.

A terminal full of a debugging session

The bug was a single failing assertion in an integration test, and it failed about one run in forty. Which is the worst possible frequency. Often enough to block the pipeline now and then, rare enough that I couldn't make it happen on demand. I lost three days to it.

The first day went on disbelief. I re-ran the suite in a loop, watched it pass forty times, then fail, then pass another sixty. I added logging, which of course changed the timing just enough that it stopped failing entirely for a whole afternoon. Heisenbugs love an audience.

The second day I finally accepted it was a race. Two goroutines were updating a shared counter, and one of them did the read-modify-write without holding the lock the other one held. Most of the time the scheduler interleaved them harmlessly. Occasionally it didn't, the counter went backwards, and the assertion downstream tripped. Obvious in hindsight, invisible until you stop looking at the symptom and start looking at the access pattern.

The fix was one line: wrap the increment in the mutex that was already sitting right there, unused on that path. Three days of work to add a defer mu.Unlock(). The real lesson is that I should have reached for the race detector on day one instead of day three. It found it in a single run.