Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

chasing a race condition for three days

A debugging story about an intermittent failure in a Go service that turned out to be an unsynchronised map write hiding behind a flaky test.

A terminal showing a stack trace

It started, as these always do, with a test that failed about one run in fifty. Everyone on the team had seen it. Everyone had hit re-run and watched it go green. "Flaky test" had become a sentence we said to absolve ourselves, and I'd said it as much as anyone. Then it failed in production, in a way that took a service down for four minutes at three in the afternoon, and "flaky" stopped being an acceptable answer.

Three days. I'm going to write down what those three days actually contained, because the tidy version, where the clever engineer spots the bug on page one, is a lie that makes everyone else feel worse at debugging than they are.

day one: reproducing it

The first job was making it fail on demand. An intermittent bug you can't reproduce is just folklore. The test failed roughly 2% of the time on CI, which is useless for investigation: you can't bisect a thing you have to run fifty times to see once.

So I did the unglamorous thing and ran it in a loop with the race detector on:

go test -race -run TestQueueDrain -count=500 ./internal/worker

The -race flag was the whole game, and I should have reached for it on day one of the year, not day one of this. Within about forty iterations it spat out a DATA RACE report. Two goroutines, one writing to a map, one reading it, no lock between them. There it was. I felt the brief, dangerous euphoria of having "found it".

I had not found it. I had found a race. Whether it was the race that took prod down was a separate question, and conflating the two is how you spend day three undoing the confident fix you made on day one.

day two: the wrong fix, and why

The race the detector flagged was in a metrics collector. A background goroutine was incrementing counters in a plain map[string]int64 while the HTTP handler read them for the /metrics endpoint. Real bug. Genuinely needed fixing. I wrapped it in a mutex, the detector went quiet, and the flaky test... still flaked.

Code on a screen with a debugger attached

This is the part nobody tells you. You can fix a real, race-detector-confirmed bug and have it be the wrong bug. The metrics map had nothing to do with the test that was failing. It just happened to be the loudest thing in the room, so the detector pointed at it first. I'd spent half a day patting myself on the back for fixing a problem that wasn't the problem.

The tell was that the failure mode didn't match. The flaky test failed with a nil result where it expected a value, not with corrupted counters. A data race in a counter gives you a wrong number, not a missing one. I'd ignored the symptom because I was so pleased to have a lead. Lesson, underlined: let the failure mode tell you where to look, not the first tool that shouts.

day three: the actual bug

Back to first principles. The test drained a work queue and asserted every job produced a result. Sometimes one result was missing. Not wrong, missing.

The queue used a channel to hand jobs to a pool of workers, and a sync.WaitGroup to know when they were done. The collector gathered results into a slice. I read the worker setup again, slowly, out loud, which is a thing I do when I've stopped trusting my own eyes:

var wg sync.WaitGroup
for i := 0; i < workers; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        for job := range jobs {
            results <- process(job)
        }
    }()
}
close(jobs)
wg.Wait()
close(results)

The bug isn't in there, exactly. The bug was in the consumer of results. A separate goroutine ranged over results and appended to a slice. But the main test body started reading the length of that slice the instant wg.Wait() returned, racing against the consumer goroutine that was still draining the last item out of the channel after close(results). The WaitGroup told me the workers were done. It told me nothing about the goroutine reading their output.

So one run in fifty, the scheduler let the assertion run before the final append landed. One missing result. No panic, no corruption, just a slice that was occasionally one element short.

Two terminals side by side comparing output

The fix was four lines: a second WaitGroup, or more honestly, a single channel that signalled the collector had finished. I went with the explicit done-channel because it reads as what it is.

done := make(chan struct{})
go func() {
    for r := range results {
        collected = append(collected, r)
    }
    close(done)
}()
wg.Wait()
close(results)
<-done // wait for the collector, not just the workers

500 iterations under -race, all green. Then 5000, to be sure I wasn't fooling myself again.

what i actually learned

The race detector finds data races. It does not find your bug. It finds a race, and you have to do the work to decide whether it's the one ruining your afternoon. Mine wasn't, the first time.

Synchronising the workers is not the same as synchronising everything that depends on the workers. A WaitGroup over the producers said nothing about the consumer, and that gap is exactly where the missing result lived.

And "flaky test" is almost always a real bug wearing a disguise we find convenient. A test that fails 2% of the time is telling you that 2% of the time your code is wrong. We just don't like hearing it, so we hit re-run. The prod outage was the bill for two months of hitting re-run.

Three days. Worth it, in the end, because the thing is genuinely fixed now rather than statistically hidden. But I'd give a fair bit to have run go test -race -count=500 on the very first morning.