I was convinced I knew where the time was going. The service had a chunky bit of parsing in the request path, full of nested loops, and it looked guilty. I had already half-written the optimisation in my head before I bothered to actually measure it. Then I ran a flamegraph and the parsing barely registered.
The tall, wide tower in the middle of the graph was a logging call. Specifically, a debug log line that formatted a struct on every request, building a string we then immediately threw away because the log level was set above debug in production. We were paying the full cost of rendering the message and none of the benefit of seeing it.
perf record -F 99 -g -p $(pidof myservice) -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg
The fix was embarrassingly small: gate the expensive formatting behind the log level check so the string is never built unless something is going to read it. About a third of the CPU on the hot path evaporated. The parsing I had been so sure about? Left it exactly as it was.
This happens to me often enough that I have stopped trusting my instincts about performance entirely. The slow thing is almost never the thing that looks slow. Measure first, then be surprised, then fix the actual problem.