the hot path was in the logging

A flamegraph rendered over a rack of servers

We had a request handler that was slow. Not catastrophically slow, just slow enough that the p99 sat at about 340ms when it had no business being above 80ms, and slow enough that someone in another team had started to notice. The annoying part is that I had a theory, and the theory was wrong, and I spent the better part of a day defending it before I bothered to actually measure.

The theory was the database. It is always the database. You profile in your head, you remember the join that looked dodgy in code review, and you go and stare at that. I added some timing around the query, the query came back in 6ms, and I stared at it some more as if the numbers might change if I willed it hard enough.

Stop guessing, start sampling

The thing I should have reached for first is the thing I reach for last, every time: a sampling profiler and a flamegraph. We were on a fairly standard setup, a long-running service, plenty of CPU to spare, and Brendan Gregg's perf-to-flamegraph toolchain has been sitting in my notes for years precisely for moments like this.

The recipe is unglamorous and I keep it in a Makefile so I never have to remember it:

perf record -F 99 -p $(pidof myservice) -g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > out.svg

99 hertz for thirty seconds, collapse the stacks, render the SVG. Open it in a browser, and the width of each box is the proportion of samples that landed in that function or its children. You are not looking for the deepest stack, you are looking for the widest box at a level you didn't expect.

A close-up of source code on a terminal

The box that shouldn't have been there

So there it was. A fat, smug plateau taking up roughly a third of the on-CPU time, and it was not the database, and it was not the handler logic. It was string formatting. Specifically, it was the construction of a debug log line that ran on every single request, built a structured message with about a dozen interpolated fields, serialised a couple of them to JSON for good measure, and then handed the result to a logger whose level was set to INFO.

The line never went anywhere. The log call was guarded, in the sense that the logger would discard it, but the guard was inside the logging library, after we had already done all the expensive work of building the message. We were paying the full cost of a log line and then throwing it on the floor, a few thousand times a second.

This is an old mistake and I knew about it in the abstract. The fix in most logging libraries is to check whether the level is enabled before you build the message, so the expensive bit only runs when something will actually consume it:

if log.DebugEnabled() {
    log.Debug(expensiveStructuredMessage(req, ctx, state))
}

Less elegant, but it means the formatting only happens when you have asked for debug output. I wrapped the three worst offenders this way, redeployed, and the p99 dropped to 71ms. The query I had spent the morning suspecting was, of course, entirely innocent.

It's worth dwelling for a second on why the original code looked so reasonable. Whoever wrote it had done everything you're meant to do. They used the logging library properly, they emitted structured fields rather than a soup of interpolated strings, and they set the production level to INFO so the debug noise wouldn't fill the disks. Every individual decision was correct. The bug lived in the gap between them, in the unexamined assumption that a log call which produces no output also does no work. Nobody writes that assumption down, and nobody questions it, because it's true in plenty of languages and plenty of libraries. It just happened not to be true in ours, in that hot path, at that level.

What I actually took away

A few things stuck.

The first is that I am perfectly capable of profiling in my imagination and being confidently, expensively wrong. The flamegraph took fifteen minutes to produce and it pointed straight at the answer. The morning of squinting at SQL produced nothing but a slightly bruised ego.

The second is that "this log line is disabled so it's free" is a lie unless the construction of the line is also disabled. Discarding the output is cheap. Building the input is not, and the building is the part that runs whether or not anyone is listening.

The third, and this is the one I'll keep, is that the cost was invisible in every metric we already had. No single field was slow. The query was fine, the handler was fine, the serialisation of the actual response was fine. The waste was smeared evenly across thousands of requests in a function nobody thought to look at because it wasn't doing anything important. That is exactly the shape of problem a flamegraph is good at, the death by a thousand cheap cuts that no single timer will ever catch.

This is also why I've come to distrust manual instrumentation as a first move. Timing a query, or a handler, or a function you already suspect, only tells you about the thing you chose to measure. It is a flashlight, and a flashlight is wonderful once you know which corner to point it at. The trouble is that the expensive corner is, almost by definition, the one you didn't think to suspect, because if you'd suspected it you'd have fixed it already. A sampling profiler doesn't share your prejudices. It simply reports where the CPU spent its time, and it has no opinion about which functions are supposed to be important. Half the value of a flamegraph is that it ignores your theory entirely.

There's a slightly humbling corollary, which is that the wider a box is, the more embarrassing it tends to be. Genuine algorithmic disasters, the accidental quadratic loops and the unbounded retries, do show up, and they're satisfying to find because at least the problem was interesting. But far more often the fat box is something dull and self-inflicted: a log line nobody reads, a serialiser called twice, a defensive deepcopy that guards against a mutation that can't happen anymore. The profiler is rarely flattering. It mostly catches you being slightly wasteful in a place you'd stopped looking.

I've now got the perf record line and the stackcollapse pipeline saved as a make flame target on every box I care about. Next time something is slow and I feel a theory coming on, the plan is to record thirty seconds first and have the theory second. We will see if I manage it.