A service was crashing roughly once an hour, no pattern I could find, no useful stack. So I did the obvious thing and ran it under strace to see what it was up to at the moment it died. It then ran perfectly for the rest of the day. The instant I detached strace, it started crashing again within the hour. A textbook heisenbug: observing it changed it.
The reason is the cruel kind of obvious in hindsight. strace intercepts every system call and that adds latency to each one, which slows the process down considerably. The crash was a race condition, two threads touching the same bit of state without proper locking, and the race only lost when the timing lined up just so. strace's overhead widened the gap enough that the timing never lined up. I wasn't observing the bug, I was suppressing it.
The fix wasn't more tracing, it was reading the code with the race in mind and finding the shared structure being mutated from two threads with nothing guarding it. A mutex around the offending block and the hourly crashes stopped, with strace nowhere in sight.
The lesson I keep relearning: if a problem disappears the moment you instrument it, that's not a dead end, that's a clue. It's telling you the bug is about timing, and your instrument has changed the timing. Sometimes the tool that won't show you the bug has just told you exactly what kind of bug it is.