The Bug That Refused To Exist Under strace

A terminal full of system call output

The service fell over roughly once an hour. Not a clean crash with a stack trace, just a worker that wedged and stopped answering, then got reaped by the health check. Reliable enough to be annoying, rare enough to be infuriating.

So I did the obvious thing and ran it under strace. And the bug went away. Completely. The worker that died every hour ran clean for the rest of the afternoon. Detach strace, wait, and within the hour it wedged again. Classic heisenbug: the act of observing it changed the timing enough to hide it.

That detail is the clue, not the curse. If slowing a thing down makes a race disappear, you have a race. strace adds a syscall trap on every call, which is a small eternity in CPU terms, and it was enough to widen the window where two goroutines were both convinced they owned the same connection. Under load they overlapped, one closed the socket the other was mid-write on, and the write blocked forever on a half-dead fd with no timeout.

The fix was boring, as the good ones are: a deadline on the write and proper ownership of the connection rather than a shared pointer two paths both reached for. I confirmed it the cheap way first, with a time.Sleep of a few milliseconds in the suspect path, which made the hang reproduce on demand without strace involved. Once you can summon a heisenbug deliberately, it stops being a ghost and becomes a bug like any other.