The error was a connection timeout, intermittent, about one request in fifty, and I was completely certain it was a race in our connection pool. I knew this the way you know things at 9am with not enough coffee in you: confidently and wrongly.
So I read the pool code. I added counters. I wrote a little reproduction that hammered it with a thousand concurrent requests, and it sat there serenely, never timing out once. The reproduction was faster and angrier than production and it could not be made to fail. That should have told me something straight away, but I'd already decided what the bug was, so I just assumed my reproduction wasn't realistic enough.
It wasn't a race. It was DNS. The service talked to a backend by hostname, and one of the resolvers in /etc/resolv.conf was a stale entry pointing at a box that had been decommissioned weeks before. Most lookups hit the first resolver and returned instantly. Occasionally one fell through to the dead one, sat there for the full five-second timeout, then retried. One request in fifty, five seconds, looks exactly like a timeout under load. It was never load. It was a resolv.conf nobody had touched since the migration.
The fix was a one-line deletion. The lesson was that I'd spent two days debugging the code because I'd decided, before looking at anything, that the code was where the bug lived. The code was fine. My assumptions were the bug. I now try to make myself state, out loud, "what am I assuming is true here that I haven't actually checked?" It is a stupid little ritual and it would have saved me a day and a half.