The worst bugs are the polite ones. They wait until you've stopped looking, do their damage, and tidy up after themselves. This one was a service that crashed roughly once or twice a day under load, left a useless stack trace, and then ran perfectly the instant I tried to observe it. The classic heisenbug: the act of measuring changed the result.
Let me set the scene, because the shape of the system matters. A Go service, fronting a hardware-ish thing over a Unix domain socket. The service accepted requests, did a small amount of work, then read a response back off the socket with a deadline. Under normal traffic it was flawless. Under burst traffic, occasionally, it would panic on a nil dereference somewhere deep in the response-parsing path, and the request that triggered it was never the same one twice.
the observation that wasn't
My first move was the obvious one. Reproduce it locally under load, attach strace, watch the syscalls, find the bad read. So I did. I wrote a small load generator, pointed it at a local instance, and got the crash reliably within a few minutes. Good. Reproducible bugs are tractable bugs.
Then I ran the exact same thing under strace -f -e trace=network,read,write, and it never crashed. Not once. I left it running for an hour. Nothing. Detached strace, crashed within minutes. Reattached, perfect behaviour.
That is an enormously useful result, even though it feels like the universe mocking you. A bug that vanishes under strace is almost never a memory-corruption ghost. It's a timing bug, and strace told me so by accident. strace slows the traced process down considerably, because every syscall traps into the tracer and back. So whatever race I had, it needed the program to be running at full speed. Slow it down, and the window closed.
slowing it down on purpose
If accidental slowdown hid the bug, deliberate, surgical slowdown might let me see it without hiding it. The trick with a timing bug is that you can't trace the whole thing, you have to widen one specific window and watch what falls into it.
I stopped reaching for strace and started reaching for the code. The response path read a length prefix off the socket, allocated a buffer, then read the body into it. My suspicion was the read between those two steps. So I added a single, ugly, temporary line: a few milliseconds of time.Sleep right after reading the length prefix and before reading the body, behind an environment variable so I could turn it on and off.
With the sleep on, the bug went from "once or twice a day" to "every single request". That's the moment a heisenbug stops being scary, when you can make it happen on demand, because now it's just a bug.
What the widened window exposed: the connection was being reused from a pool, and under burst load two goroutines could end up holding the same net.Conn. There was a connection-pool getter that, in the contended path, handed the same connection to two callers. One would read the length prefix for its own request, the other would read the body, the framing would desynchronise, and the parser would walk off the end of a half-read message and dereference a nil. Without the sleep, the two reads were close enough together in time that the collision was rare. The sleep made the collision the default.
the actual fault
Here's the embarrassing root cause. The pool's "get a connection" function had a check-then-act race:
func (p *pool) get() *conn {
if len(p.idle) > 0 {
c := p.idle[len(p.idle)-1]
p.idle = p.idle[:len(p.idle)-1]
return c
}
return p.dial()
}
No lock. Or rather, a lock that had been there once and got refactored out during a "tidy-up" months earlier, on the assumption that the pool was only ever touched by one goroutine. It wasn't, under load. Two goroutines could both observe len(p.idle) > 0, both index the same last element, and both walk away with the same *conn. A textbook check-then-act, hidden because it only lost the race under exactly the timing that strace destroyed.
The fix was a mutex around the whole get-and-return, which is what should have been there all along:
func (p *pool) get() *conn {
p.mu.Lock()
defer p.mu.Unlock()
if n := len(p.idle); n > 0 {
c := p.idle[n-1]
p.idle = p.idle[:n-1]
return c
}
return p.dial()
}
And then the part I should have started with. I ran the load generator one more time with -race. The Go race detector pointed straight at the conflicting reads and writes on p.idle and the shared conn, with both stacks, in seconds. If I'd reached for -race before strace, I'd have saved myself most of an afternoon. The lesson I keep relearning is to match the tool to the symptom: strace is for "what syscalls is it making", and this was never that. It was "two goroutines touching the same thing", which is exactly what the race detector exists to find.
what I took away
A bug that disappears under observation is data, not bad luck. It's telling you the bug lives in timing, so your job is to control timing, not to chase ghosts. Slow the suspect step down deliberately, on a flag, and see whether the bug gets worse. If it does, you've found the window. If it doesn't, you're looking in the wrong place and you've at least ruled something out cheaply.
And keep -race in your reflexes. It carries real overhead, so it's not for production, but for "occasionally crashes under load and I can reproduce it locally", it's the first thing to reach for, not the last. The whole episode took an afternoon. It should have taken twenty minutes, and would have, if I'd believed the very first clue: it only broke when it was running fast.