the flamegraph that pointed at the one line i'd never have guessed

A flamegraph rendered as coloured horizontal bars over a server background

We had a service that was using more CPU than its job warranted. Nothing was on fire, latency was fine, but a box that should have been coasting at fifteen percent was sitting nearer fifty under steady load, and that meant we were paying for more instances than the work justified. I had a theory about why, as one does, and my theory was wrong, as it usually is. The flamegraph told me so in about thirty seconds, which is the entire point of flamegraphs.

the assumption

The service does request fan-out: it takes an incoming request, splits it into several backend calls, merges the responses, and returns a combined result. The merge step does some set intersection and dedup work, and in my head that was obviously the expensive bit. It's the only part with anything resembling an algorithm in it. Everything else is plumbing. So my plan, before I measured anything, was to go and optimise the merge.

I'm glad I measured first, because I'd have spent a day making the merge faster and moved the needle not at all.

getting the picture

I grabbed a profile under load. The toolchain is the usual: sample the running process with perf, fold the stacks, and render with Brendan Gregg's flamegraph.pl. The recipe hasn't really changed in years and it still works beautifully.

perf record -F 99 -p $(pgrep myservice) -g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > out.svg

-F 99 samples at 99 Hz, deliberately not a round 100 to avoid lockstep with anything periodic in the system, and thirty seconds under real traffic is plenty to get a stable shape. Then you open the SVG in a browser and read it. The width of a box is how much time was spent in that frame and everything above it. You're hunting for wide plateaus, the frames that are surprisingly fat.

A flamegraph SVG open in a browser with one wide plateau highlighted

what it actually said

The merge was a sliver. You could barely see it. The wide plateau, nearly a third of the whole graph, was logging. Specifically, string formatting inside log calls.

We logged a structured line for every backend call: which backend, the latency, a request id, a few tags. Reasonable on the face of it. The problem was that the formatting happened unconditionally, before the log level was checked. We were running in production at info level, and a big chunk of these were debug lines, so the formatted string was built, the arguments stringified, a fresh allocation made for each one, and then the logging framework looked at the level and threw the whole thing away. We were doing all the expensive work to produce log lines that were never written.

Once the flamegraph put it right in front of me it was obvious, and a little embarrassing, because it's a classic and I know it's a classic. The fix is to guard the expensive formatting behind a level check, or to use a logging API that takes the arguments lazily and only formats if the line will actually be emitted. We did both: moved the genuinely hot debug lines behind an explicit check, and audited the rest for accidental eager evaluation.

# before: format always runs, result often discarded
log.debug("backend %s took %dms for %s", name, ms, format_tags(tags))

# after: nothing happens unless debug is on
if log.debug_enabled():
    log.debug("backend %s took %dms for %s", name, ms, format_tags(tags))

format_tags was the real villain. It walked a map and built a string on every call, and it was being called for log lines that went straight in the bin.

the result, and the lesson

CPU on a representative box dropped from the high forties to the low twenties under the same load. We took two instances out of the pool and nothing noticed. That's the whole reason the exercise pays for itself: profiling is cheap and instances are not.

A monitoring graph showing CPU usage stepping down after a deploy

The lesson is the one I keep relearning and apparently need to keep relearning. Do not optimise what you assume is slow. Measure, then optimise what is actually slow, and accept that those two things are different more often than your pride would like. My instinct sent me at the merge, the one place with a clever algorithm, because clever algorithms feel like where the cost should be. The actual cost was in the dull plumbing, in work we were doing and then deliberately discarding, and no amount of staring at the merge code would ever have shown me that.

The other small lesson: a flamegraph is one of the highest-leverage half-hours in this whole job. Thirty seconds of sampling, a couple of perl scripts that have existed for years, an SVG you read by eye, and it routinely overturns a theory I'd have otherwise spent a day chasing. I run one now before I let myself touch a single line in the name of performance. It has saved me from my own confidence more times than I'd care to count.