The symptom was a query coming back with the wrong result set. Not malformed, not an error, just rows that belonged to a different request. About one call in fifty thousand, which on a quiet service is roughly never and on this one was a few times an hour. Enough to be real, not often enough to catch.
My first three guesses were all wrong, and all comforting, which is how you know they're wrong. I blamed the database. I blamed a stale read replica. I blamed, briefly and shamefully, "the network". None of them survived contact with the logs, because the logs showed two different requests touching what was clearly the same underlying connection in the same handful of milliseconds.
the connection pool was lying to me
It was a homegrown connection pool. Of course it was. Someone, possibly past me, had decided the standard one wasn't quite right and written a thin wrapper around it. The wrapper kept a free list and handed connections out on request.
The bug was in the hand-back. When a request finished, it returned its connection to the pool. But there was a path, an early return on a context cancellation, where the connection was returned to the free list and then still used for one final write of an error response. So the pool would hand that same connection to the next caller while the previous caller was mid-sentence on it.
Two goroutines, one connection, interleaved reads and writes. The result was exactly the sort of "wrong rows" you'd expect: caller B's query went out, caller A's leftover bytes came back, and the protocol layer was just credulous enough to parse them.
why it took three days
I want to be honest about the three days, because the bug itself is a one-liner. The first day I spent not believing it was a race at all. The second day I spent adding logging that perturbed the timing enough to make it vanish, which is the cruellest thing a race can do to you. Every probe I added moved the window.
What finally pinned it was going the other way. Instead of more logging, I added a deliberate runtime.Gosched() and a tiny sleep right after the connection was returned, to widen the window rather than narrow it. The failure rate went from one in fifty thousand to one in twenty, and suddenly it was a bug I could sit and watch.
func (p *pool) put(c *conn) {
select {
case p.free <- c:
default:
c.Close()
}
// the cancellation path below was still writing to c here
}
The fix was to make the cancellation path take a copy of what it needed before returning the connection, and to never, ever touch a connection after handing it back. One ownership rule, written down in a comment so the next person doesn't rediscover it the hard way: once you've called put, that connection is no longer yours.
I added a race detector run to CI the same afternoon. It would have caught this in seconds if anyone had ever run the tests under -race, which is the part I'm least proud of. The tool was sitting right there the whole time, and the three days were mostly me deciding to trust it.