Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the bug that only existed when nobody was watching

A race condition in a worker that crashed reliably in production but went perfectly behaved the moment strace was attached, and how the slowdown itself was the clue.

A tangle of terminal output with a single bug highlighted

The worst bugs are the ones that change their behaviour when you look at them, and this one was a textbook example. A background worker, written in Python, would crash about once an hour under load with a KeyError deep inside a cache layer. Not often enough to be obvious, often enough to page someone overnight. The stack trace pointed at a dictionary lookup that, by every reading of the code, should never have been able to miss.

So I did the obvious thing. I attached strace to a live worker and waited for it to fall over.

It didn't. It ran for hours, perfectly happy, then I detached and it crashed within ten minutes. I tried again, same result. The instant I was watching the syscalls, the bug ceased to exist. That is the canonical shape of a heisenbug: the act of observation perturbs the timing, and the timing is the bug.

My first instinct, like everyone's, was to distrust the tool. Maybe strace was buffering something, maybe attaching changed the signal handling, maybe I'd grabbed the wrong PID. I checked all of that. The PID was right, the worker was doing real work under the trace, the syscalls scrolling past were exactly the ones I expected. The tool was behaving perfectly. It just happened to be making the bug impossible to reproduce, and that is a very different thing from the tool being broken.

the thing about slowing a program down

strace is not free. Every syscall the process makes gets intercepted, the tracer is woken, the arguments are decoded, and only then does the call proceed. For a process doing a lot of small I/O, that is a brutal tax, easily a 10–100x slowdown on the syscall-heavy paths. Normally that tax is an annoyance. Here it was the entire diagnosis, I just hadn't realised it yet.

If a bug appears at full speed and disappears when you slow one part of the program down, the bug is almost certainly a race. Something is happening in two places at once, and the win goes to whoever gets there first. Slow one side down and the race stops being a race. The two threads stop arriving at the same moment, the contended window stops getting hit, and the symptom evaporates whilst the underlying mistake sits there untouched.

This is also why "add more logging and see if it still happens" so often makes a race go away. A print is a syscall, a write to a file descriptor, possibly a flush. Each one nudges the timing. People conclude the logging "fixed" it, ship the extra logging, and feel uneasy for the next six months when it comes back under different load. It was never fixed. It was just hidden behind a bit of extra latency in exactly the wrong place.

A close-up of source code on a dark editor

I stopped trying to catch it with strace and started reading the code as if I already believed it was a race. The cache was a plain dict shared between the main work loop and a refresh thread. The refresh thread would, periodically, rebuild the cache. And it rebuilt it like this, give or take:

def refresh(self):
    new = self._load_from_db()
    self._cache.clear()        # <-- the gap
    self._cache.update(new)

There it is. Between clear() and update() the cache is empty. If the work loop does a lookup in that window, the key genuinely isn't there, and you get a KeyError for a key that exists everywhere except in that one fraction of a millisecond. Under strace the refresh thread was slowed down enough that the window shifted and, more to the point, the work loop was slowed down too, and the odds of landing in that exact gap dropped to nothing in practice.

why I half-believed the GIL would save me

The honest part of this story is that I had assumed, lazily, that Python's global interpreter lock would protect me here. It does protect individual bytecode operations. It does not protect a sequence of them. clear() and update() are two separate operations with a yield point between them, and the GIL will happily hand control to the other thread right in the middle. The GIL stops you corrupting the dict's internal structure. It does nothing for your logic being non-atomic across two calls.

The fix was to build the new dict fully and then swap the reference in a single assignment:

def refresh(self):
    new = self._load_from_db()
    self._cache = new          # atomic rebind, no empty window

Rebinding a name is a single operation as far as the running threads are concerned. There is never a moment where the cache is half-built. Readers either see the old dict or the new one, never an empty one. No lock, no window, no KeyError.

To convince myself rather than just hope, I wanted to actually reproduce the race on demand before claiming a fix. So I added a deliberate time.sleep(0.05) between the old clear() and update() on a test instance and pointed a tight loop of lookups at it. The KeyError rate went from "once an hour" to "constantly", which is exactly what you want from a repro: make the rare thing common by widening the window on purpose. With the bug now firing on every run, I applied the swap-the-reference fix and the error rate dropped straight to zero, sleep and all. That's the difference between believing you've fixed a race and knowing it: you make it worse first, then watch the fix beat it even with the odds stacked against you.

It held. The overnight pages stopped. The worker has run for weeks without a single one of those crashes since.

what I took away from it

A few things stuck.

  • A bug that vanishes under strace, ltrace, gdb, or even just heavier logging is telling you something specific: it's timing-dependent. Treat the disappearance as data, not as the tool being useless.
  • "The GIL makes this safe" is a sentence I will now always finish with "...for a single bytecode, and not for the two calls I actually wrote."
  • Build-then-swap beats mutate-in-place whenever a structure has readers. The empty window in the middle of a mutation is exactly the kind of thing that only ever bites you at 3am.

The irony is that strace did eventually solve it, just not by catching the crash. It solved it by refusing to reproduce, and that refusal told me precisely what kind of bug I was chasing. Sometimes the tool that won't show you the problem has already shown you the answer.