A service had got slow. Not fall-over slow, just slow enough that requests which used to take 40ms were now taking 200ms, and nobody could point at a change that explained it. The usual suspects were lined up: the database, the JSON marshalling, the external call we all quietly resent. I was certain it was the database.
It was not the database. It was the logs.
Stop guessing, profile
I had spent a frankly embarrassing amount of time reading code and reasoning about where the time "must" be going. The lesson I relearn roughly once a year is that I am bad at this and the profiler is good at it. So I gave up theorising and grabbed a profile under load.
A flamegraph is the right tool for exactly this. Width is time. You scan along the top for the widest plateau and that is where your CPU actually went, regardless of what you assumed. No narrative, no sympathy for your hypothesis, just the numbers.
The widest bar was log formatting
The widest plateau, by a distance, was string formatting inside our logging. We logged a structured line on every request, and somewhere along the way that line had grown to interpolate a handful of values that were themselves expensive to compute. Worse, we computed all of them eagerly even when the log level meant the line was thrown away.
So on the hot path, for every single request, we were building a debug string that nobody would ever read. At 40ms a request that cost was invisible. At the traffic we had grown into, it was most of the latency.
The fix was dull, which is how you know it was the right fix. Move the expensive interpolation behind the log level check so it only happens when the line is actually emitted, and drop two fields nobody had looked at in a year. Latency went back to where it belonged. No clever algorithm, no rewrite, just stopping the program from doing work it then discarded.
The thing I keep taking away from these is not the specific bug, it is the order of operations. Profile first. Your intuition about where the time goes is a hypothesis, and the flamegraph is the only honest test of it. I would have happily spent two days optimising database queries that were never the problem.