We had a Go service that was, by every average metric, perfectly healthy. CPU comfortable, memory flat, mean latency a few milliseconds. And yet the p99 would lurch up to half a second under real traffic, in a way that no amount of staring at Grafana explained. Averages are liars, and this one had been lying to us for weeks. This is the story of finding out why, mostly so I remember the order to do it in next time.
Reproduce it before you profile it
The first mistake I nearly made was profiling production directly. Don't, if you can help it. I spent an hour with vegeta building a load test that actually reproduced the p99 spike on a staging box. That hour bought me a repeatable signal I could profile, change, and re-measure without waiting for production traffic to misbehave. If you can't reproduce the symptom, you're not debugging, you're guessing with extra steps.
echo "GET http://localhost:8080/api/lookup" | \
vegeta attack -duration=60s -rate=500 | \
vegeta report
Once the staging box showed the same fat tail, I had something to work with.
pprof, in the order that actually helps
Go's net/http/pprof is the gift that makes this bearable. Wire it in, hit the endpoints under load, and look. My order:
import _ "net/http/pprof"
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30
go tool pprof http://localhost:8080/debug/pprof/heap
The CPU profile under load was the first surprise. A startling fraction of time was in the garbage collector and in runtime.mallocgc. That's the fingerprint of allocation pressure: you're not slow because of your logic, you're slow because you're making so much short-lived garbage that the GC is constantly interrupting you. The mean hides it because most requests dodge the collection; the unlucky few that land mid-GC eat the whole pause, and there's your p99.
The allocation, found and fixed
The heap profile pointed straight at a hot path that was allocating a fresh byte buffer per request to assemble a response. Hundreds of times a second, all of them garbage a millisecond later. The fix is the boring, correct one: a sync.Pool of buffers.
var bufPool = sync.Pool{
New: func() interface{} { return new(bytes.Buffer) },
}
func handle(w http.ResponseWriter, r *http.Request) {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
// build the response into buf
}
That alone took a chunk off the GC time and pulled the tail in noticeably. But not all the way, which is where it got interesting.
A word of caution on pools, because I've seen them go wrong. A sync.Pool is not a cache and it doesn't guarantee anything: the runtime can clear it whenever it likes, and it does so on GC. That's a feature, not a bug, but it means a pool is only worth it when the thing you're pooling is genuinely expensive to allocate and you're allocating it on a hot path. For a per-request buffer at hundreds of requests a second, yes. For an object you make twice an hour, you're adding complexity for a saving you'll never measure. And you must Reset() what you take out, because the previous tenant's bytes are still in there. I learned that the polite way, with a test, rather than the impolite way, in production.
I also went back and turned on allocation tracking properly with -benchmem on the relevant benchmarks, so the saving was a number rather than a vibe:
go test -run=XXX -bench=BenchmarkLookup -benchmem ./...
Seeing allocs/op drop from a few dozen to nearly nothing is the kind of feedback that tells you the change was real and not just a placebo you talked yourself into.
The lock I didn't know I had
With the allocation noise gone, a second shape appeared in the profile that the garbage had been hiding: contention on a single sync.Mutex around a shared cache. The block profile made it obvious.
go tool pprof http://localhost:8080/debug/pprof/block
Every request was queueing on one lock to read a map that almost never changed. Under 500 requests a second, that queue was the tail. Reads massively outnumbered writes, so the fix was an sync.RWMutex, letting readers run concurrently and only serialising the rare write. The p99 dropped from roughly 480ms to about 35ms, and stayed there when I pushed the rate higher.
The reason this hid so well is worth dwelling on. Lock contention costs you nothing on average until you cross a throughput threshold, and then it costs you everything, very suddenly. Below the knee, the lock is almost always free when you ask for it. Above it, requests start arriving while it's held, and they queue, and the queue is your tail. The mean barely moves because most requests still get the lock instantly; it's only the ones that arrive at a bad moment that wait, and those are precisely the requests that define your p99. This is the same shape as a motorway that's fine at ninety percent capacity and gridlocked at ninety-five. Contention doesn't degrade gracefully. It falls off a cliff.
I'll admit I reached for the RWMutex slightly nervously, because a read-write lock isn't a free win either. It has more overhead per operation than a plain mutex, so if writes were common, or if the critical section were tiny, it could have been slower, not faster. Here the read-to-write ratio was enormous and the section was a map lookup, so it was the right tool. But "use an RWMutex" is not a reflex to trust blindly; it's a trade you should be able to justify with the actual ratio in front of you. If the writes had been frequent I'd have looked at sharding the map or a copy-on-write swap instead.
What I'm taking away
Three things worth remembering. Reproduce the symptom under load before you touch a profiler, or you'll optimise the wrong thing beautifully. Fix the loudest cost first, because it hides the next one: the lock contention was invisible until the allocations stopped drowning it out. And never trust a mean. The whole problem lived in the tail, and the only honest way to see a tail is to go and measure it, request by request, with a tool that doesn't average your pain away.
The service is boring again now, which is exactly what I wanted. p99 boring is the best kind.