the flamegraph that pointed at the one function i'd never have guessed

A performance graph on a server monitoring screen

A service at work had got slow. Not fall-over slow, just slow enough that the p99 latency had crept up over a few months until it was uncomfortable, and nobody could point at a change that caused it. The usual suspects had all been checked. The database wasn't the bottleneck, the queries had sane plans, the network was fine. So I did the thing I should have done first and actually profiled it.

I'd been meaning to properly learn flamegraphs for ages. This was the excuse.

getting the data

The service is a long-running daemon on Linux, so the tooling is perf and Brendan Gregg's flamegraph scripts. The recipe is well worn by now but worth writing down because I always forget the exact incantation.

# sample the running process at 99 Hz for 30 seconds
perf record -F 99 -p $(pgrep -f myservice) -g -- sleep 30

# turn the recording into folded stacks
perf script | ./stackcollapse-perf.pl > out.folded

# render the svg
./flamegraph.pl out.folded > flame.svg

The 99 Hz sampling rate is deliberate, slightly off a round number so it doesn't beat against anything in the application that happens to run on a tidy interval. Thirty seconds was enough to get a representative picture under normal load. The output is an SVG you open in a browser, where the width of each box is the proportion of samples that landed in that function, summed across everything it called.

A wide flamegraph rendered from profiling data

The first thing to understand about a flamegraph is that height means nothing and width means everything. A tall thin spike is a deep call stack that the CPU barely visits. A wide plateau is where the time actually goes. Your eye wants to follow the tall towers because they look important. Ignore them. You're hunting for the widest box at any given level that isn't just a framework dispatcher.

the bit i didn't expect

I went in with a theory. I was sure the time was in our serialisation layer, because it always is, and serialisation is the kind of thing that quietly gets more expensive as your objects grow. The flamegraph had other ideas.

There was a fat plateau, easily a fifth of the total samples, sitting under the logging library. Specifically under the function that formats the timestamp on every log line. We were doing structured logging, a line per request, and each line called a date formatter that, under the hood, was allocating a new buffer, parsing a format string, and consulting the timezone database. Every single log line. Under load that was tens of thousands of times a second, and it had been hiding in plain sight because no individual call was slow. It was death by a million tiny formats.

The serialisation layer I'd been so sure about? A thin sliver. Barely registered.

Profiling output and source code side by side

This is the whole argument for profiling instead of guessing, made concrete. I have a decade of intuition about where programs spend their time, and that intuition pointed confidently at the wrong place. The flamegraph doesn't have a theory. It just shows you the samples, and the samples were piling up in date formatting.

the fix, and what it bought

The fix was almost insulting in its simplicity. The timestamp for a given second is the same for every log line in that second, so we cache the formatted string and only recompute it when the second ticks over. A few lines:

var (
    lastSec   int64
    lastStamp string
    stampMu   sync.Mutex
)

func timestamp(t time.Time) string {
    sec := t.Unix()
    stampMu.Lock()
    defer stampMu.Unlock()
    if sec != lastSec {
        lastSec = sec
        lastStamp = t.Format("2006-01-02T15:04:05Z07:00")
    }
    return lastStamp
}

Under heavy logging that turns thousands of format calls per second into roughly one. The p99 dropped by about a third immediately, and the allocation rate fell off a cliff, which had the pleasant knock-on effect of easing GC pressure too. The thing that had been slowly strangling the service was a convenience we'd never thought twice about.

A few things I took away from the exercise. Profile before you optimise, obviously, but more than that, profile before you even form an opinion, because the opinion will anchor you. The cheapest-looking code, the stuff you call so often you stop seeing it, is exactly where a small per-call cost becomes an enormous aggregate one. And flamegraphs are genuinely worth the hour it takes to get fluent in reading them. Once you've seen one wide plateau point at something you'd never have suspected, you stop trusting your gut and start trusting the samples.

The logger is still doing structured logging on every request. It just isn't asking the timezone database for permission every time anymore.