the flamegraph that pointed at the one function i'd never have profiled

A performance graph on a server monitor

A service that should have been comfortably idle was sitting at a steady, baffling forty percent CPU under load it had no business struggling with. The hot path turned out to be log timestamp formatting, of all things, and I'd never have found it by reading the code, because the code that was killing us looked completely innocent. This is a short hymn to flamegraphs, which found the answer in about ten minutes after I'd spent two days guessing wrong.

The setup: a request-handling service, nothing exotic, doing a bit of database work and a bit of business logic per request. Throughput was fine, latency was fine, but CPU usage was roughly double what I'd expect for the work involved, and that gap had been quietly there for months. Nobody had chased it because it wasn't on fire. It was just wasteful, and waste compounds: more boxes than necessary, less headroom for spikes, a higher bill for nothing. I finally got annoyed enough to look.

guessing, the slow way

My first instinct, like everyone's, was to guess. I "knew" where the time went. There was a query in the hot path that joined a couple of tables and I'd always been slightly suspicious of it. There was some JSON serialisation that felt heavy. There was a regex in the request validation that I'd written in a hurry and never loved. I spent the better part of two days optimising those, in order of suspicion, and made the service measurably worse at one point by caching something that didn't need caching. The CPU graph did not move. Not a flicker.

This is the classic failure and I walked straight into it. I optimised what I could imagine being slow, which is a completely different set from what is actually slow. The two overlap less often than your ego would like.

There's a particular trap in being experienced, too. After enough years you build up a stock of plausible-sounding intuitions about performance, and those intuitions are right often enough to feel trustworthy and wrong often enough to waste days. "Joins are slow." "Serialisation is expensive." "Regexes are a smell." Each of those is true sometimes, which is exactly what makes them dangerous, because you can always find a join, some serialisation and a regex in any non-trivial codebase and pin the blame there with a clear conscience. I had all three to choose from and I chose all three, in turn, and none of them was the answer. The intuition tells you where to look; it does not tell you the truth, and I'd let it stand in for measurement entirely.

the flamegraph

So I did what I should have done on hour one. I grabbed a sampling profiler, ran it against the process under representative load for thirty seconds, and folded the output into a flamegraph. If you've not used them, the shape is simple once it clicks: the x-axis is not time, it's how much of the sampled stack each function accounts for, so a wide box means a lot of CPU spent in or below that function. You read it by scanning along the top for the widest plateaus, because those are where the CPU actually lives.

Source code with the offending log call highlighted

There was a plateau I did not expect at all. A fat, embarrassing block, a big chunk of total samples, sitting under the logging call. Not the database. Not the JSON. The logger. And when I drilled into it, almost all of that was in timestamp formatting: turning the current time into a nicely formatted string for the log line prefix.

Here's the kicker. We logged a line per request at info level, and every one of those lines formatted the full timestamp from scratch: a strftime-style call building a date string, allocating, the lot. Individually that's nothing, microseconds. But at our request rate it was being called constantly, and it dominated. The thing I'd been ignoring entirely because "it's just logging" was the single biggest consumer of CPU in the whole service. The query I'd been suspicious of for years? A thin sliver on the graph. Rounding error.

What makes it so easy to miss is that logging is invisible in the way furniture is invisible. You stop seeing it. It's not "the work", it's the stuff around the work, so it never occurs to you to count its cost. And each individual log call genuinely is cheap, so if you reason about it in isolation, which is how I'd always reasoned about it, you conclude it can't possibly matter. The flamegraph doesn't reason about anything in isolation. It just shows you the aggregate, and in aggregate a cheap thing done a few hundred thousand times a second is not cheap at all. That's the whole value of a sampled profile: it weights by frequency automatically, so the death-by-a-thousand-cuts cases that defeat human intuition show up as a single fat, undeniable box.

the fix, and the lesson

The fix was almost insultingly small. The timestamp only changes once a second for our purposes, so there's no reason to reformat it on every single log line. Cache the formatted second-resolution string, regenerate it only when the second actually ticks over, and append the sub-second part cheaply if you need it. That, plus dropping a couple of log lines that were debug noise nobody read, took the service from forty percent down to about fifteen. Same work, same throughput, a third of the CPU.

# before: format the full timestamp on every log line
log.info("%s request handled in %dms", format_now(), elapsed)

# after: reuse a per-second cached prefix
log.info("%s request handled in %dms", cached_ts(), elapsed)

The actual lesson isn't about logging, though "your logger is probably more expensive than you think" is a genuinely useful thing to carry around. The lesson is the one I keep having to relearn: do not guess at performance. I'm a reasonably experienced engineer and my mental model of where the time went was confidently, completely wrong, and I burned two days acting on it before I bothered to measure. The flamegraph didn't need me to be clever. It just showed me the truth, which is that the CPU was being spent somewhere I'd never have thought to look, on a problem I'd have sworn wasn't a problem.

Profile first. Then optimise the wide box, not the scary-looking one. The scary-looking one is usually fine, and the boring one is usually eating your lunch.