A Go service of ours was using more CPU under load than felt right, and the team's collective guess was the database layer. I have learned to distrust the collective guess. CPU is one of the few things you do not have to reason about, because Go ships a profiler that will simply tell you. So I stopped guessing and turned it on.
Wiring up profiling is almost insultingly easy. Import the pprof handlers and they hang themselves off your HTTP mux:
import _ "net/http/pprof"
That gives you /debug/pprof/ on whatever server you already run. The one rule: do not expose it on a public listener. I bind it to an internal admin port, or behind the firewall, and never the front door.
The other half is load. A profile of an idle service tells you nothing useful, so I ran a load generator (hey, in this case) against a representative endpoint at a rate close to what hurts in production, then grabbed a 30-second CPU profile while it ran:
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
The flame graph (web in the pprof prompt, if you have graphviz) was not subtle. The database was a sliver. The two fat columns were encoding/json marshalling our responses, and, embarrassingly, our logging. We were JSON-encoding a structured log line on every single request at info level, including a couple of fields that themselves serialised a nested struct. The logger was costing us nearly as much as the actual work.
The allocation profile told the same story from another angle. go tool pprof on the heap endpoint, sorted by alloc_space, showed the request handler churning short-lived buffers for both the response JSON and those log lines. High allocation rate means frequent GC, and frequent GC means CPU you are spending on bookkeeping rather than answering requests.
Two changes did most of the work. We dropped the per-request structured log to debug level and kept only a thin info line, which removed a serialisation from the hot path entirely. And we gave the JSON encoder a reused buffer from a sync.Pool instead of letting it allocate fresh each time. Neither touched the business logic the team had been so sure was the problem.
The result was a little over a third less CPU at the same request rate, and noticeably calmer GC pauses in the trace. The lesson is the boring one I keep relearning: measure before you optimise, because the bottleneck is almost never where the team agrees it is. Profiling under real load took an afternoon and saved us from rewriting a database layer that was, it turns out, perfectly fine.