There is a particular flavour of dread that arrives when a bug stops reproducing the moment you try to observe it. Not "I can't reproduce it", which is annoying but normal. This was worse: it reproduced reliably, several times an hour, right up until I attached a debugger or ran it under strace, at which point it vanished completely and stayed vanished. The act of looking made it stop. Physicists named this after Heisenberg and it is every bit as smug as it sounds.
the symptom
The process was a data importer, a long-running daemon that pulled batches off a queue and wrote them to disk. Every so often, no clean pattern, it would crash with a corrupted read: the bytes it got back from a file were not the bytes it had written. Not garbage exactly, but a chunk from a different batch, as though two pieces of work had got their wires crossed.
I did the obvious thing and reached for strace, because if you don't know what a process is doing, watching its system calls is a reasonable first move:
strace -f -e trace=open,read,write,lseek -p 4412
And the crash stopped. Not "happened less". Stopped, completely, for the entire afternoon I left it traced. The moment I detached, within twenty minutes, it crashed again. I tried gdb instead, same result: under the debugger, perfect; on its own, broken.
why observation changes things
This is the bit that turns a Heisenbug from infuriating into a clue, once you stop taking it personally. A bug that disappears under observation is almost always a race condition, and the reason is mechanical, not mystical.
strace and gdb both slow the traced process down enormously. Every system call under strace traps into the tracer, gets logged, and returns; that's orders of magnitude slower than a bare syscall. A debugger does similar violence to your timing. So when a bug vanishes under tracing, the tracing hasn't fixed anything. It's changed the timing so the race almost never loses. The bug is still there. You've just made the unlucky interleaving astronomically less likely by making everything slow and serial-feeling.
Which immediately tells you the real cause is timing-dependent, and that two things are racing. That's not a fix, but it's a direction, and it stops you hunting for a logic bug in code that's actually correct in isolation.
finding it without scaring it off
The trick is to observe in a way that doesn't perturb the timing. strace is a sledgehammer. What I wanted was a quiet observer.
So I added logging inside the process itself, to a ring buffer in memory rather than to disk or stderr, recording for each batch the file descriptor it opened, the offset it wrote at, and the thread that did it. Writing a few integers into a pre-allocated in-memory buffer is cheap enough that it doesn't meaningfully change the timing, so the bug would still fire. Then, on the corrupted read, I dumped the buffer.
The dump was damning. Two worker threads, processing two different batches, had been handed the same file descriptor number. One thread opened a temp file, got fd 7, and started writing. The other thread, a hair later, also opened a temp file. The first thread closed its fd 7 when it finished. The second thread's open reused 7, the kernel being entirely within its rights to recycle a just-freed descriptor. And a third piece of code, holding a cached copy of "the fd for batch A is 7" from before, happily read from 7 after it had been reassigned to batch B.
So it was never file corruption at all. The bytes on disk were always fine. The process was reading the right bytes from the wrong file, because a file descriptor is just a small integer and we'd cached one across the boundary where it was no longer valid. Under strace, the threads were slowed and spread out enough that the close-and-reuse window practically never lined up with a stale read. On its own, at full speed, it lined up several times an hour.
the fix, and the lesson
The fix was to stop passing raw file descriptors around between the layers and to stop caching them at all. Each piece of work carried its file object for its whole lifetime and closed it exactly once at the end, with no other code allowed to hold a copy of the number. A descriptor is only meaningful inside the bit of code that owns it, and the bug was entirely about that ownership being shared when it shouldn't have been.
The lesson I took, and keep taking, is that a bug which hides from your tools is telling you something specific: it's about timing. Don't fight to make strace reproduce it. Find a way to watch that doesn't change the clock, an in-process ring buffer, a counter, a cheap atomic log, and let the bug happen while you're quietly taking notes. The Heisenbug isn't being clever. It's just sensitive to exactly the variable your debugger stamps all over, and once you know that, you know roughly what you're looking for.