Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

three days lost to a bug that only happened when i wasn't looking

A race condition that vanished under every debugger and logging statement, until a missing lock around a lazily-initialised cache finally gave itself away.

A terminal full of debug output chasing an intermittent bug

Three days. A service that fell over roughly once every few thousand requests with a nil dereference, and never, ever when I was watching it. The classic shape of a race: every diagnostic I added changed the timing enough to make it disappear. Add a log line, the bug sulks off for an hour. Attach a debugger, it goes on holiday entirely.

The thing that finally cracked it was giving up on reproducing live and reading the code as if I trusted nobody. There was a cache that initialised itself lazily on first use, guarded by a check-then-set with no lock around the pair. Under load, two requests would both see the cache empty, both start building it, and one would hand a half-built map to the other. Single-threaded, impossible. Under concurrency, just rare enough to look like haunting.

The fix was a sync.Once so the initialisation happened exactly once no matter how many goroutines arrived together. Five lines. Three days.

What I take from it: stop trying to catch an intermittent bug in the act when adding observers changes the timing. Find the shared state, find the window where two things touch it without a lock, and the reproduction stops mattering. I knew that already. I'll know it harder now.