Benchmarks lie, or rather they tell the truth about a workload that is not yours. The only profile I trust is one taken from the running service while real traffic is hitting it, and Go makes that genuinely pleasant, which is one of the quiet reasons I keep reaching for it.
The setup is two lines. Import the pprof handlers for their side effect and make sure something is serving HTTP:
import _ "net/http/pprof"
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
Bind it to localhost, not the world. The pprof endpoints will happily dump your goroutine stacks to anyone who asks, and that is not a thing you want exposed.
With that running, you pull a thirty-second CPU profile straight off the live process while it is under load:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
That opens the interactive viewer with a flamegraph and a call graph. Thirty seconds during a real traffic peak tells you more than any synthetic benchmark, because it captures the actual mix of request shapes, the cache pressure, and the contention you only get when everything is happening at once.
The CPU profile is the obvious one, but the allocation profile is where I usually find the real wins in Go:
go tool pprof http://localhost:6060/debug/pprof/allocs
Type top for the worst offenders, then list SomeFunc to see the allocation pinned to a specific line. In Go the GC cost is downstream of allocation rate, so a function that allocates a lot in the hot path shows up as CPU somewhere else entirely, in the garbage collector. The allocation profile is what connects the two. Chasing GC CPU without it is hopeless.
Two more endpoints earn their keep. The goroutine profile, at /debug/pprof/goroutine, is the first thing I pull when a service is wedged rather than slow: it dumps every goroutine and its stack, and a leak shows up as thousands of them all parked on the same channel send or mutex. And the blocking profile, which you have to enable explicitly with runtime.SetBlockProfileRate, tells you where goroutines are waiting on synchronisation. CPU profiling shows you where time is spent running; the block profile shows you where time is spent not running, which for a contended service is frequently the larger number.
The mistake I see most is people profiling a load test instead of production. A load test exercises the path you thought to test, evenly, with warm caches and no surprises. Real traffic is lumpy and weird, and the hot path under real traffic is frequently not the one you would have benchmarked. Take the profile from the thing that is actually serving users. It costs almost nothing to leave the endpoint running behind localhost, and the day you need it you will be very glad it is already there rather than something you have to deploy in a panic.