There is a particular kind of dread that arrives when a bug behaves differently because you are observing it. A normal bug, however nasty, sits still while you poke it. A heisenbug moves. You reach for it and it steps aside, and the act of looking changes the thing you are looking at. This one cost me the best part of two days, and the most maddening part was that my single most reliable debugging tool was the thing making it disappear.
The symptom: a batch worker crashed roughly once every few hundred runs. Not a clean exit, a genuine crash, occasionally a corrupted output file left behind. Rare enough to be infuriating, common enough to matter. So I did what I always do, which is reach for strace to watch the syscalls and see exactly where it falls over.
strace -f -o /tmp/worker.trace ./worker --input batch.json
And it ran perfectly. Of course it did. I ran it again. Fine. A hundred times in a loop under strace, not a single crash. Detach strace, run it bare, crash within thirty runs. The tool that was supposed to show me the bug was the one thing guaranteed to suppress it.
what strace actually changes
If you have not hit this before, the important thing to understand is that strace is not free. It intercepts every syscall the process makes, which means every open, read, write, stat, all of it, takes a detour through the tracer. The process runs, but it runs slower, and crucially the timing between operations stretches out. For most bugs that does not matter. For a race condition it matters enormously, because a race is a bug about timing, and strace changes the timing.
So the fact that strace hid the crash was not a dead end. It was the biggest clue I had. It told me this was almost certainly a race, because the thing that made it vanish was slowing the process down. A logic error does not care how fast you run. A race cares about almost nothing else.
finding the window without scaring it off
The trick with an observer-sensitive bug is to observe it as lightly as possible. strace was too heavy. I needed something that left the timing more or less intact. So I went lighter: a handful of log lines written with timestamps, buffering off, and ltrace ruled out early because it was nearly as intrusive as strace. Mostly I sat and read the code with the working hypothesis "this is a race" instead of "this is a crash", which is a completely different way of reading.
That reframing is the whole game. When you believe you are looking for a crash, you look for the line that is wrong. When you believe you are looking for a race, you look for two things that touch the same resource without agreeing on order. And there it was, almost immediately, once I knew what shape I was hunting.
The worker wrote intermediate results to a temp file with a predictable name:
tmp = f"/tmp/worker-{os.getpid()}.tmp"
write_chunk(tmp, data)
os.rename(tmp, final_path)
Looks fine. The bug was that under our batch runner the same job could, in a rare scheduling case, be kicked off twice with overlapping lifetimes, and os.getpid() was not as unique as I had leaned on it being once pids started getting reused on a long-lived host. Two runs, same temp name, both writing, one renaming out from under the other. The corruption was two writers interleaving. The crash was one process trying to rename a file the other had already moved.
Under bare execution, the two writers overlapped often enough to collide every few hundred runs. Under strace, everything slowed down so much that the first run finished its rename long before the second got near the file. The overlap window closed. The bug "went away", not because it was fixed but because I had widened the gap it needed to fall through.
the fix and the lesson
The fix was to make the temp name genuinely unique, with tempfile.mkstemp rather than a pid-based guess, so two runs can never pick the same path:
import tempfile
fd, tmp = tempfile.mkstemp(dir="/tmp", suffix=".tmp")
write_chunk(fd, data)
os.rename(tmp, final_path)
mkstemp creates the file atomically with a name nothing else will collide with, which is the entire point of it, and which I should have reached for at the start instead of hand-rolling a name out of the pid.
I also went back and dealt with the deeper cause, which was the batch runner kicking off the same job twice in the first place. The unique temp name meant the two runs no longer corrupted each other's output, but two runs still meant the work was being done twice, and one of them would overwrite the other's final file. So I added a lock keyed on the job identity, taken before the work starts, so a second invocation of the same job either waits or bows out rather than racing. The temp-file fix stopped the crash; the lock stopped the duplication. Two separate bugs wearing one symptom, which is its own small lesson: once you have fixed the thing that made it crash, ask whether the thing that made it crash was itself a symptom of something further up.
To confirm the race theory before I trusted the fix, I did the thing that should have occurred to me far sooner: I reproduced it deliberately. A short script firing the worker many times in parallel turned "once every few hundred runs" into "almost every run", which is the difference between a bug you hope you fixed and a bug you can prove you fixed. With the parallel harness in place I could run the broken version and watch it fail in seconds, apply the fix, and watch the same harness come up clean. A heisenbug stops being a heisenbug the moment you can make it appear on demand.
What I took away was not really about temp files. It was about what to do when a bug changes under observation. The instinct is to feel cheated, as though the bug is cheating. It is not. It is telling you something specific: that timing is part of the cause. When strace makes a bug vanish, that is not strace failing to find it. That is strace pointing straight at a race and saying "here, it's about speed". Read the disappearance as evidence, not as a wall, and the observer effect stops being a curse and starts being the clue.