Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the bug that only existed when nobody was watching

A race condition that disappeared the moment I attached strace, and how slowing the program down was the whole point.

A terminal showing a debugging session

The worst bugs are the ones that go shy when you look at them. We had a worker process that would, perhaps once every few thousand runs, write a corrupted file: half the new content, half the old, spliced at no sensible boundary. Reproducing it took a tight loop and a lot of patience. And then the moment I attached strace to watch the syscalls, it stopped happening. Of course it did.

A bug that vanishes under observation is not magic. It is telling you something about timing. strace does not run alongside your process politely. It intercepts every syscall, which adds latency to each one, which changes the scheduling, which is exactly the variable your bug depends on. By measuring, I had changed the thing I was measuring.

A close-up of source code on screen

So I stopped trying to catch it in the act and reasoned about it instead. The corruption pattern, old content and new content interleaved, only makes sense if two things are writing the same file at once. The code "knew" that could not happen because it wrote to a temporary file and renamed it into place. The classic safe pattern. Except the temporary filename was built from a timestamp with second resolution, and under load two workers could start within the same second and pick the same temp name. They then trampled each other's writes before either renamed.

I confirmed it without strace by adding a single log line that recorded the temp filename and the process id every time a worker started a write, then leaving the loop running and grepping the output afterwards. No live observation, no scheduling perturbation, just a record I could read at leisure. Sure enough, every corruption in the log had two workers claiming the same temporary name within the same second. The evidence was there the whole time, I had just been trying to watch it happen instead of letting it leave a trail.

The fix was boring, which is how you know it was right:

# before: collides when two workers start in the same second
tmp = f"{target}.{int(time.time())}.tmp"

# after: unique per process and call
tmp = f"{target}.{os.getpid()}.{uuid4().hex}.tmp"

What I take from this is less about the bug and more about the tool. strace did not fail me by hiding the bug. It told me, very loudly, that timing was the variable, because changing the timing changed the outcome. The disappearing act was the diagnosis. Once I read it that way the rest was just careful thinking, no tooling required.

The general rule I now keep: when a bug runs away from your instrumentation, ask what your instrumentation changed. Latency, scheduling, buffering, ordering. The Heisenbug is not refusing to be caught. It is pointing at the answer and you are looking at your hands.