The symptom was the worst kind: a test that passed locally every single time, then failed in CI roughly once every fifty runs. Not often enough to block a merge, just often enough to poison your trust in the whole suite. Somebody re-ran the job, it went green, and we all carried on. That worked right up until it didn't, and the same intermittent failure started showing up in production logs.
The first day I spent doing the thing you always do and shouldn't: staring. I read the failing test, read the code under it, read it again, and convinced myself it was correct. It mostly was. That's the trap with a race condition. The code is right almost all of the time, which is precisely why it's so hard to see the moment it isn't.
Making it fail on purpose
I got nowhere until I stopped trying to read the bug and started trying to provoke it. Two things broke it open.
First, I stopped running the test once and started running it a thousand times in a row, in parallel, on a machine with more cores than my laptop:
go test -run TestReconcile -count=1000 -race -parallel 8 ./internal/reconciler/
The -race flag is the part that mattered. On a single happy run it says nothing. Under repetition, with the detector armed, it finally caught the access it had been quietly tolerating.
The report pointed at a shared map[string]state that I'd convinced myself only the reconcile loop touched. It wasn't. A metrics goroutine I'd added months earlier, in a different PR, for an entirely unrelated reason, was reading that map to count entries. One reader, one writer, no lock, and a comment three lines up that cheerfully said "single-threaded, no locking needed". It had been true when I wrote it.
The fix and the lesson
The fix was dull, which is how you know you've found the real cause. A sync.RWMutex around the map, a read lock in the metrics path, a write lock in the reconciler. The thousand-run loop went green and stayed green. I left it churning overnight just to be sure, and it was still green in the morning.
The lesson wasn't "use a mutex", I already knew that. It was that a comment describing an invariant is a liability the moment someone adds code that breaks it, and nothing checks. The -race detector does check, it just needs you to actually exercise the contended path. We now run the race detector on a nightly job that loops the integration suite a few hundred times, precisely so the next one-in-fifty failure shows up on a schedule we choose, rather than at three in the afternoon on a release day.
Three days. For a missing lock. I'd like to say I'm above being annoyed by that, but I'm not.