the flamegraph that pointed at the one function i trusted

A flamegraph rendered above a rack of servers

A request handler that should have been comfortably under 50ms was sitting at 200ms under load, and I had a theory. I always have a theory. The handler did three database calls and then assembled a response, so obviously the time was in the database calls, and obviously the fix was a better query or an index I had forgotten. I spent the better part of an afternoon being obviously wrong before I gave up guessing and actually measured the thing.

The reason I am writing this down is that the answer was nowhere near the database. It was in a fifteen-line helper I had written months earlier, never thought about since, and would have sworn was free.

measuring instead of guessing

The service is a Go service, so the tooling is right there and I have no excuse for not reaching for it sooner. I added a profiling endpoint, hammered the handler with a load generator, and pulled a thirty-second CPU profile:

import _ "net/http/pprof"

# then, under load:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

The text view of a profile is honest but hard to read. top gives you a flat list of the heaviest functions, which tells you where the time went but not how you got there. The first time you sort by cumulative and see your own database driver near the top, you nod, because that confirms the theory you walked in with. It is a trap. The flat list flatters whatever you already believe.

The flamegraph is the view that stops you fooling yourself. Stacks on the y-axis, width proportional to time on the x-axis, and the eye is drawn straight to the widest bar whether or not it fits your story.

A flamegraph with one unexpectedly wide frame near the leaves

The database calls were there, sensible and narrow, roughly the width I expected. The wide bar, the one that made me sit up, was a frame called formatEventTime. My helper. The thing I would have bet money was a rounding error in the total.

the helper i trusted

Here is roughly what it did. It took an event, formatted a timestamp, and returned a string for the response:

func formatEventTime(t time.Time, tz string) string {
    loc, _ := time.LoadLocation(tz)
    return t.In(loc).Format("2006-01-02 15:04:05 MST")
}

It looks innocent. It is not. time.LoadLocation reads the timezone database off disk, parses it, and builds a *time.Location. On the platforms where the zoneinfo lives in a file, that is real I/O and real parsing, every single call. And this helper was not called once per request. The handler assembled a list, and it called formatEventTime once per row, several hundred times per response, hundreds of times a second under load. Each call quietly went and re-read the timezone database from scratch.

That is the part flamegraphs are good at and intuition is bad at. The cost per call was small enough that nothing about the function looked alarming in isolation. It was the multiplication, function times call count times request rate, that turned a quiet little helper into the single hottest thing in the process. You cannot feel multiplication by reading code. You can see it instantly as a wide bar.

The fix was embarrassing in its simplicity. Load each location once and cache it:

var locCache sync.Map // string -> *time.Location

func locationFor(tz string) *time.Location {
    if v, ok := locCache.Load(tz); ok {
        return v.(*time.Location)
    }
    loc, err := time.LoadLocation(tz)
    if err != nil {
        loc = time.UTC
    }
    locCache.Store(tz, loc)
    return loc
}

Same correctness, the disk read happens once per zone for the life of the process rather than once per row. The 200ms handler dropped back under 50ms, and the database calls I had been ready to rewrite turned out to be perfectly fine all along. I had been about to optimise the wrong code, with confidence.

what i took from it

A few things I keep relearning, written down so perhaps I learn them this time.

Profile before you optimise, every time, even when, especially when, you are sure you already know the answer. The certainty is the symptom.
The flat top list confirms your bias. The flamegraph argues with it. Reach for the picture.
A function being short and obviously-correct says nothing about how much CPU it burns. Cost is correctness times call count, and call count is invisible in the source.
Anything that touches disk, the network, or a parser has no business being on a per-row hot path. If it is, cache the expensive part and pay for it once.

I now have a slightly healthier reflex. When something is slow, I no longer start with a theory about which query is to blame. I start with a profile, and I let the widest bar tell me where to look, even when it points straight at the one function I trusted.