A service was failing roughly one start in ten with a missing file it had written a moment earlier. Annoying, intermittent, the worst kind. So I did the obvious thing and ran it under strace to watch the syscalls.
It worked. Every time. Twenty clean starts in a row. The instant I detached strace, the failures came back. I had, briefly, the dispiriting sense of debugging a poltergeist.
But the disappearance is the diagnosis. strace intercepts every syscall, which makes the traced process dramatically slower and changes its timing. If a bug vanishes under that, it is almost never the tracing fixing anything. It is a race, and the tracing has simply slowed one party down enough to lose it on purpose.
In this case a writer and a reader were racing on the same path: the code wrote the file in one step and a second process read it in the next, with no fsync, no rename-into-place, no signal that the write had landed. Most of the time the write won. Under strace it always won, because strace put a thumb on the scales. The fix had nothing to do with the trace and everything to do with making the ordering explicit: write to a temp file, fsync, atomic rename, and only then tell the reader to look. The poltergeist was a missing happens-before.