A worker process was crashing maybe once an hour, no pattern I could see, no useful stack. The obvious move was to run it under strace -f and watch. So I did, and it ran clean for the rest of the afternoon. Stopped strace, crash came back within twenty minutes. Classic heisenbug: the act of observing made it go away.
The reason it went away is the whole point. strace slows the traced process down enormously, ptrace stops on every syscall, and that slowdown was just enough to change the timing of a race. Two threads were touching a shared structure without a lock, and under normal speed they collided often enough to corrupt it. Under strace, one thread was always far enough behind the other that they never overlapped. The bug wasn't gone, it was hiding behind the latency I'd added.
Once I knew it was a timing race rather than a logic error, the search narrowed fast. I stopped looking for a bad value and started looking for unsynchronised shared state, found a counter being incremented from two goroutines without any protection, and the data race detector confirmed it in one run.
The lesson I keep relearning: if a bug vanishes the instant you instrument it, that's not bad luck, that's evidence. The tool changed the timing, so the timing was the bug. Reach for a race detector before you reach for printf.