Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the bug that only existed when nobody was looking

A flaky failure that disappeared the moment strace was attached, which was itself the clue: the syscall tracing slowed the process just enough to hide a race condition.

A terminal mid-strace with syscalls scrolling past

A service was failing roughly one start in ten with a missing file it had written a moment earlier. Annoying, intermittent, the worst kind. So I did the obvious thing and ran it under strace to watch the syscalls.

It worked. Every time. Twenty clean starts in a row. The instant I detached strace, the failures came back. I had, briefly, the dispiriting sense of debugging a poltergeist.

But the disappearance is the diagnosis. strace intercepts every syscall, which makes the traced process dramatically slower and changes its timing. If a bug vanishes under that, it is almost never the tracing fixing anything. It is a race, and the tracing has simply slowed one party down enough to lose it on purpose.

In this case a writer and a reader were racing on the same path: the code wrote the file in one step and a second process read it in the next, with no fsync, no rename-into-place, no signal that the write had landed. Most of the time the write won. Under strace it always won, because strace put a thumb on the scales. The fix had nothing to do with the trace and everything to do with making the ordering explicit: write to a temp file, fsync, atomic rename, and only then tell the reader to look. The poltergeist was a missing happens-before.