three days for a bug that lived in a missing word

A terminal showing an intermittent test failure

The test failed about one run in thirty. Locally it passed every time, which is the universal opening line of a bad week. On CI it failed often enough to be embarrassing and rarely enough that "just rerun it" had quietly become policy. I hate that policy. A test you rerun until it's green is not a test, it's a coin you keep flipping until you like the answer.

So I sat down to actually kill it. Day one was spent disbelieving the problem: adding logging, watching it pass two hundred times in a row, convincing myself it was a CI fluke, then watching CI fail again with the smug timing of a thing that knows you are watching. Day two I built a loop to run the single test five thousand times under the race detector, because if it's intermittent and it touches concurrency, you reach for the detector first.

go test -run TestSnapshotMerge -race -count=5000 ./internal/state

That found it in about forty runs. Two goroutines, one map, and no lock between them. A background refresh wrote into a shared map of cached entries while a request handler read from it, and most of the time the read and the write didn't overlap. Most of the time. The detector doesn't care about most of the time, which is exactly why it's worth the slowdown.

The fix was a word: a mutex round the two accesses, or honestly an sync.RWMutex since reads dominate. Three days of work, one Lock(), one Unlock(), and a comment explaining why so the next person doesn't "tidy it away". The humbling part is that the race had been there for over a year, sitting just under the threshold of how often anyone ran that test. It wasn't new. I had simply finally run it enough times to make the rare thing common.