A worker process crashed roughly once an hour with a segfault. Annoying, but reproducible, which is half the battle. So I did the obvious thing and ran it under strace -f to see the last syscall before it died.
It stopped crashing. Completely. I left it overnight under strace and it ran clean for fourteen hours. Killed the trace, and within the hour it fell over again. The classic Heisenbug: observing it changed it. strace adds latency to every syscall by trapping into the kernel, and that latency was enough to paper over the actual problem.
The problem was a race. Two goroutines (well, two threads under the C library this thing wrapped) touched the same buffer during teardown, and normally one won the race comfortably. strace slowed everything down just enough that the timing widened and the loser always arrived late, after the buffer was already safely freed and rebuilt. The trace did not hide the bug so much as accidentally fix it, by making the fast path slow.
The tell was that strace "fixing" it pointed straight at timing rather than logic. I dropped the tracer, built with ThreadSanitizer instead, and it named the exact two stacks racing on the free. Five minutes after that the lock was in the right place. Sometimes the most useful thing a tool tells you is that it had no business helping at all.