Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the flamegraph that pointed at logging, of all things

A flamegraph of a service I thought I understood put a third of the CPU in structured logging, and the fix was less code rather than faster code.

A performance flamegraph on a server monitor

I had a service that was using more CPU than its job warranted, and I had a confident story about why. It did a lot of JSON serialisation on the response path, so obviously the serialiser was the problem. I had even started sketching a faster encoder in my head. Then I actually profiled it, and the flamegraph put a third of the on-CPU time somewhere I would never have guessed: logging.

This is the recurring lesson of profiling, and I keep having to relearn it. Your intuition about where the time goes is a hypothesis, not data, and it is wrong more often than it is right. The whole point of a flamegraph is to argue you out of your assumptions.

getting the picture

The service was a Linux daemon, so I used perf and Brendan Gregg's flamegraph scripts. The recipe has not really changed in years:

perf record -F 99 -g -p $(pgrep myservice) -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg

-F 99 samples at 99 Hz, deliberately off a round number so you do not phase-lock with any periodic activity in the program. Thirty seconds of a service under real load is plenty. Then you open the SVG in a browser and read it: width is time on CPU, the y-axis is stack depth, and you are hunting for plateaus that are wider than the story in your head says they should be.

Source code on a screen

the wide tower i did not expect

The JSON encoder was there, sure, maybe 12% of samples. Annoying but not the headline. The headline was a fat tower spending around 30% of CPU inside the logging library, and underneath it, formatting and allocating.

The cause, once I looked, was obvious and slightly humiliating. We logged a structured line per request at info level, and every one of those lines built a context object: a map of a dozen fields, several of them produced by calling functions that themselves allocated. The clever part was that the logger was lazy, it would not format the line if the level was disabled. But we ran at info in production, so it always formatted, and the "lazy" path was never taken. We were paying full price for a feature designed to let us pay nothing.

The allocations were the real damage. Every request minted a fresh map and a fistful of short-lived strings, which the garbage collector then had to chase down. The serialiser was honest work. The logging was pure overhead dressed up as observability.

the fix was deletion

The temptation was to make the logging faster: a pool for the field maps, a cheaper formatter, the usual. I did none of that. I asked the question that profiling should always make you ask, which is not "how do I make this faster" but "why am I doing this at all".

We did not need a structured info line for every single successful request. We needed it for errors, for a sampled fraction of successes, and that was it. So info-level per-request logging went to a 1-in-100 sample, errors stayed at full fidelity, and the per-request context object was only built when we had actually decided to emit the line.

before: log every request at info  -> always allocate, always format
after:  sample 1/100 + all errors  -> allocate only when we log

CPU on the hot path dropped by roughly a quarter, and the allocation rate fell enough that GC pauses became a non-event. The JSON encoder I had been so keen to rewrite I left completely alone, because at 12% it simply was not worth the risk of a bug for the gain on offer.

what i keep relearning

Profile before you optimise, every time, no exceptions, because the thing you are sure about is the thing the flamegraph exists to disprove. And once you have found the hot path, the best optimisation is frequently to do less work rather than to do the same work faster. Deleting the per-request log line was a smaller, safer change than any encoder rewrite, and it bought more. The fastest code is the code that does not run.