the flamegraph that pointed at the wrong thing, then the right one

A flamegraph rendered wide and orange across a dark terminal

We had a service that did almost nothing and did it expensively. It took requests, looked something up, and returned a small JSON object. No database round trip on the hot path, everything cached in memory, and yet under load it would happily eat a whole core per few thousand requests a second. That ratio felt wrong. A lookup and a serialise should be cheap. Something was spending CPU it had not earned, and the metrics we already had were no help because they were all about the things we thought might be slow.

So I did the thing I should do first more often than I do. I stopped guessing and took a profile.

Stop reading the code, start reading the stacks

The temptation, when something is slow, is to reread the code and reason about it. I find I am bad at this. I reliably "know" where the time is going, and I am reliably wrong, because the code I wrote on purpose is the code I stare at, and the time is almost never there. It is in the code I did not think about at all.

A flamegraph fixes that, because it does not care what I think. It samples the stack and shows you, by width, where the CPU actually sat. Wide means hot. You read it across, not down.

On Linux the recipe has barely changed in years, and that is a compliment:

perf record -F 99 -g -p $(pgrep myservice) -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg

Thirty seconds of sampling at 99 Hz under representative load, collapse the stacks, render the SVG. Open it in a browser, hover, and start reading from the widest blocks downward.

A close-up of a collapsed stack and the perf invocation that produced it

The first answer was a trap

The widest tower, at first glance, was JSON serialisation. About a quarter of the time, sitting in the encoder. And I nearly believed it, because it fit a story: small object, but lots of them, encoding adds up, swap the encoder for a faster one and move on.

That would have been a day spent shaving a real but minor cost while the actual problem carried on. Because when I looked at what was calling the encoder, it was not the response path. It was the logging.

We logged every request at info level, and the log line included a structured copy of the response object. Serialised. Every time. So for each request we encoded the payload once to send it and a second time to write it into a log nobody read at volume, and the second encode was wrapped in a logging framework that did its own allocation and formatting on top. The serialiser was not the villain. It was an accomplice, called twice, once entirely pointlessly.

The hot path I never suspected

The genuine surprise was further down, half-hidden under the logging tower: a call to gather caller context for the log record. The logging library, to produce a nice contextual line, walked the stack on every emitted record to find out where the log call came from. Stack walking is not free. Doing it thousands of times a second, to decorate a line that was then immediately serialised and written to a file that rotated hourly and was grepped approximately never, was costing us real money in CPU.

None of this showed up as "slow" anywhere, because none of it was on the path I had instrumented. It was overhead, smeared evenly across every request, exactly the kind of thing a flamegraph is good at and dashboards are bad at.

The fix was almost embarrassingly small:

- log.Info("handled request", "response", resp)   // serialises resp, walks stack
+ log.Debug("handled request", "id", req.ID)       // off in prod, cheap fields only

Drop the response object out of the log entirely. Log an identifier, not the payload. Move the line to debug so it is compiled-in but inert in production. Configure the logger not to capture caller information unless explicitly asked.

The result was not subtle. CPU per request fell by roughly a third, the per-core throughput went up to match, and the flamegraph taken afterwards was boringly flat: the serialiser was back to a thin sliver doing its one honest job, and the logging tower had simply gone.

A word on reading them honestly

There is a discipline to reading a flamegraph that took me a while to learn, and it is mostly about resisting the story your brain wants to tell. The eye is drawn to the tallest towers, because they look dramatic, but height is just stack depth: a deeply nested call that does nothing costs nothing. Width is the only thing that means CPU. So you read across the bottom, find the widest plateaus, and only then climb up to see what is sitting on top of them.

The second discipline is to keep asking "who called this" rather than "what is this." A hot leaf function is a fact. The reason it is hot is a story, and the story lives in its callers. Our serialiser was genuinely hot; that was a fact and the graph was not lying. But the reason was two levels up, in a logging call I would never have suspected, and if I had stopped at the fact I would have optimised the wrong thing with great confidence. The graph rewards you for following the call chain up until it stops making sense, because that is usually where the accidental cost lives.

It also helps to take more than one profile. A single thirty-second sample under one load pattern can mislead you, because some costs only appear under certain shapes of traffic. I took one under steady load and another during a burst, and the logging tower was wider in the burst, which was the hint that the cost scaled with request rate rather than being fixed startup overhead. Two cheap captures told me more than one would have, and both together were still cheaper than an hour of reasoning.

What I took from it

A few things, none of them new, all of them ones I forget between incidents.

The first wide block on a flamegraph is a suspect, not a verdict. Always look at who is calling it. The cost was in serialisation; the cause was a log line.
Logging is code, and on a hot path it is hot code. Anything that serialises an object, walks a stack, or allocates per record will show up under load, and "we never read those logs" makes it worse, not better, because you are paying full price for nothing.
Profile before you optimise, every time. I had a confident, plausible, wrong theory ready to go. Thirty seconds of perf was cheaper than the day I would have spent being wrong with conviction.

The flamegraph did not tell me the answer. It told me where to stop guessing, which was the part I was getting wrong on my own.