Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

three days, one race, and a log line that lied to me

A field-by-field account of hunting an intermittent race condition that only showed up under load, and the small assumption that hid it for three days.

A bug crawling across a terminal

The bug reported itself once a day, roughly, always in production, never anywhere I could attach a debugger. A request would come back with a stale value: the right shape, the right type, just the data from somebody else's request a few milliseconds earlier. Maybe one in fifty thousand. Enough to be real, rare enough to be infuriating.

This is the long version of three days, because the short version ("it was a shared buffer") tells you nothing about how I got there, and the getting-there is the only part worth keeping.

Day one: not believing it

The first day I spent not believing the bug. That's an honest description. Intermittent, low-rate, cross-request data bleed feels like a caching layer doing something clever, so I went and read the cache config, found nothing, and convinced myself it was a CDN edge serving a stale response. It wasn't. The responses had no-cache headers all the way down and I'd checked them an hour earlier and then forgotten.

The wasted day did produce one useful thing. I added a request ID, generated at ingress, carried through every layer, and stamped on every log line. If a response was going to come back with the wrong data, I wanted to be able to prove which two requests had got tangled.

Day two: the log line that lied

With request IDs in place I caught a tangled pair. Request A returned data clearly belonging to request B, and the timestamps showed they'd been in flight at the same moment. So far, so race.

I went looking in the obvious place: a struct we passed down into the handler. I added a log line at the point we populated it.

log.Printf("req=%s populated user=%s", reqID, ctx.User.ID)

The logs were clean. Every request populated its own struct with its own user. No overlap. I stared at this for most of an afternoon, because the logs said the bug could not exist and the responses said it did. One of them was wrong, and logs feel authoritative, so I doubted the responses again. That cost me hours.

The log line wasn't lying about what it printed. It was lying by omission. It logged the user we put in. It said nothing about what we read back out further down, and the gap between those two events was exactly where the damage was.

A diff highlighting a shared buffer

Day three: the shared thing

The third morning I stopped trusting any single point and instrumented the read side as well as the write side, logging the pointer address alongside the request ID.

log.Printf("req=%s read user=%s ptr=%p", reqID, u.ID, u)

And there it was, plain as anything. Two different request IDs, same pointer address. They were reading and writing the same object.

The cause was a pool. We'd added a sync.Pool for these structs months earlier to cut allocation pressure, which it did, very effectively. The rule with a pool is that you return the object when you're done and you do not touch it afterwards. Somewhere in a refactor, a deferred function had ended up holding a reference to the struct and reading from it after it had been put back into the pool and handed to the next request. Under light load the window was too small to ever collide. Under production load, one in fifty thousand requests landed inside it.

The fix was three lines: stop returning the object to the pool until the deferred work had genuinely finished, which meant moving the Put to the very end of the call rather than where it read tidily. Less elegant. Correct.

The corrected lifecycle, drawn out

What I'd tell myself on day one

Two things would have saved me two days.

First, when logs and behaviour disagree, the logs are incomplete before they're wrong. I logged the write and trusted it to describe the read. A pointer address in the log line from the start would have ended this on day one. I now reach for %p early whenever I suspect shared state, and I feel slightly foolish that I don't do it by default.

Second, a race that only appears under load wants to be reproduced under load, not stared at under none. I eventually wrote a small loop that hammered the handler from a few hundred goroutines and ran it under go test -race, and the race detector caught it in under a second. I could have written that loop on the morning of day one. I wrote it on the afternoon of day three, after I already knew the answer, mostly to prove to myself I'd actually fixed it.

The data bleed has not recurred. The request ID stayed, the load test stayed in the suite, and I kept the pool. Pools are fine. It's the assumption that an object is yours after you've given it back that gets you.