the allocation i couldn't see until i put it under load

Code on a screen, a programming workspace

A profile of an idle service tells you almost nothing useful. I keep relearning this. The interesting behaviour, the contention, the allocation churn, the lock convoy, only shows up when the thing is genuinely busy, and "genuinely busy" is hard to fake at your desk.

This particular service was fine at ten requests a second and fell over a cliff somewhere north of a few hundred. p99 latency went from single-digit milliseconds to over a second. The CPU profile I took locally, hammering it with a quick loop, showed nothing alarming. Of course it didn't. One client doesn't reproduce a concurrency problem.

So the first move was load, not profiling. I put vegeta in front of it at a rate that actually hurt, then grabbed the profile from the live thing while it was suffering:

go tool pprof -seconds 30 http://localhost:6060/debug/pprof/profile

The CPU profile under load was more honest, but the real culprit turned up in the allocation profile. pprof has a heap profile and, more usefully here, the mutex and block profiles, which you have to switch on explicitly:

runtime.SetMutexProfileFraction(5)
runtime.SetBlockProfileRate(10000)

The block profile lit up on a single sync.Mutex guarding a map that every request touched. Not a slow operation in itself, just one that everybody had to queue for. Under light load nobody waited. Under real load the whole service was effectively single-threaded through that one lock, and all the lovely goroutines I'd been so pleased with were standing in a line.

The fix was boring: shard the map, or in the end just reach for a sync.Map for that access pattern, which suited it. p99 came back down to where it belonged. The lesson, again, was not about the fix. It was that the profile I trusted first was taken under conditions that couldn't reveal the bug.

A few things I'd tell my earlier self. Always expose net/http/pprof on an internal port; the cost of having it there and unused is nothing, the cost of not having it during an incident is an afternoon. Profile the system under a load that resembles production, not under a load that's convenient. And when CPU looks innocent and latency is still bad, the answer is usually waiting, not work, so go straight for the block and mutex profiles. The flame graph of where your program is blocked is often more interesting than where it's running.