Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

three days lost to a gap between two lines

A flaky test that failed once in fifty turned out to be a check-then-act race in cache initialisation, and the fix was a single lock.

A terminal with a stack trace and a bug

It started as one flaky test. Pass, pass, pass, pass, fail, pass. Roughly one run in fifty, and never the same assertion. That's the signature of a race condition, and races are where good days go to die.

The first day went on disbelief. I assumed it was the test being badly written: a missing await, a shared fixture, the usual suspects. I rewrote the test, isolated it, ran it ten thousand times in a loop overnight. It still failed, just rarely, which killed my comfortable theory and left me with the uncomfortable one. The bug was real and it was in the code under test.

The second day went on instrumentation. You cannot reason a race into the open; you have to widen the window until it's obvious. I added timestamped logging around the suspect path and, crucially, dropped a few artificial delays in to make the timing worse on purpose. That's the trick people skip. If a race needs two threads to interleave in a one-microsecond window, you'll never see it. Slow one side down with a sleep and suddenly it fails every time.

Lines of source code on screen

With the window forced open, the shape appeared. Two requests arriving at once would both find the cache uninitialised, both decide they were the one to build it, and both go off and build it. Classic check-then-act: the check and the act weren't atomic, so two threads could pass the check before either did the act. Most of the time one beat the other comfortably and you never noticed. Occasionally they tied, and the second build clobbered something the first had already handed out, and a downstream assertion fell over with a value that made no sense.

The third day was the easy bit, which is always how it goes. Once you can see a race, fixing it is usually a few lines. Here it was a single lock around the initialise-if-empty section so that only one thread could be in the check-and-build at a time, with the second thread waiting and then finding the work already done. Double-checked locking, done properly this time. The artificial delays now proved the fix instead of the bug: with the sleeps still in and failing reliably before, the locked version ran clean every time.

Then I took the sleeps back out, ran the original flaky test a hundred thousand times in CI, and watched it stay green. Three days for what amounts to one with lock:. The maddening part of races is exactly that ratio. The fix is trivial; finding out which two lines needed it is the whole job.