where the time actually went in a slow Go service

A screen of profiler output and source code

A service that was comfortable at light traffic started chewing CPU and missing its latency targets once real load arrived. Everyone had a theory. The theories were all plausible and, as usual, all wrong. So instead of guessing, I turned on the profiler that ships with Go and let it tell me.

If you import net/http/pprof, you get a set of debug endpoints for free. The CPU profile is the one I reach for first:

import _ "net/http/pprof"

That blank import registers the handlers on the default mux. Then, with the service under representative load, you grab thirty seconds of CPU profile and open it interactively:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

A flame graph showing one wide bar dominating the profile

The collective guess had been database contention. The flame graph said otherwise. The widest bar, by a distance, was JSON encoding. The service serialised the same large config object into every response, and encoding/json was burning the bulk of the CPU doing reflection over that struct on every single request.

The top view made it unambiguous:

(pprof) top
      flat  flat%   sum%
     8.4s  41.2%  41.2%  encoding/json.(*encodeState).marshal
     2.1s  10.3%  51.5%  reflect.Value.Field
     1.3s   6.4%  57.9%  runtime.mallocgc

Forty per cent of CPU re-encoding a payload that almost never changed. The database was fine. It had always been fine.

The fix was nearly embarrassing in its simplicity. The config object changed at most a few times a day, so I encoded it once when it changed and cached the resulting bytes, rather than re-marshalling identical data on every request. The hot path went from "reflect over a struct and allocate" to "write a byte slice we already have". CPU under the same load dropped by roughly a third, and the latency tail came back inside target.

Two things stuck with me. First, nobody guessed JSON, because it's invisible: it's not code you wrote, it's a library call you take for granted, so it doesn't feature in anyone's mental model of where time goes. The profiler doesn't share your blind spots. Second, the profile has to be taken under load. At idle the encoding cost is trivial and you'd never see it. The whole point is to measure the system in the state where it actually hurts. Guess less, profile more, and take the profile when it's busy.