The service was fine in every benchmark I'd written and miserable under real traffic. P99 latency climbed steadily under load and the only thing the dashboards agreed on was that the garbage collector was busy. That's the trap with microbenchmarks: they exercise one function in a tight loop, which is exactly the condition under which the thing that's actually slow stays hidden.
So I stopped guessing and turned on pprof. If you've not wired it in, it's two lines:
import _ "net/http/pprof"
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
That registers the profiling handlers on a private port. Then put the thing under sustained load, I used hey because it's trivial, and grab a CPU profile while it's suffering:
hey -z 60s -c 100 http://localhost:8080/api/things
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
The CPU profile pointed at JSON encoding, which I half expected. But top in pprof only told me where time went, not why there was so much of it. The allocation profile was the one that told the story:
go tool pprof http://localhost:6060/debug/pprof/heap
list on the offending function showed the culprit in about ten seconds of reading. Every request was building a fresh map[string]interface{} to shape the response, then handing it to json.Marshal. Under one request at a time that's cheap. Under a hundred concurrent requests it's a flood of short-lived allocations, the GC runs constantly to keep up, and every goroutine pays the tax in stop-the-world pauses. The benchmark never saw it because the benchmark never had a hundred goroutines all allocating at once.
The fix was unglamorous. I replaced the dynamic map with a concrete struct and proper json tags, so the encoder could do its job without me building an intermediate object first. Allocations per request dropped by roughly two thirds, GC frequency fell off a cliff, and P99 came back down to where it should have been all along.
Two things I took away. First, profile under load that resembles production, not in isolation, because the interesting failures are emergent. Second, when a Go service feels slow, check allocations before you optimise CPU. Half the time the CPU is busy because the allocator is busy, and the real fix is making less garbage, not running the same garbage faster.