Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the rewrite that was meant to be faster, and was not

A "more efficient" data structure made a hot loop slower, and the reason was cache misses rather than instruction count, which took me an afternoon to accept.

A latency graph on a server monitoring dashboard

I rewrote a hot loop to do less work and it got slower. Not by a little. The "optimised" version took roughly forty per cent longer on the same input, and for an afternoon I refused to believe the numbers because they offended me.

The loop walks a large set of records and aggregates them. The original stored everything in a flat slice of structs and iterated straight through. The clever version, the one I was so pleased with, replaced the slice with a map keyed by ID and a separate slice of pointers, so I could skip records I had already processed without a linear scan. Fewer comparisons, less work per element, obviously faster. Except it was not.

measuring instead of guessing

The first rule, which I break constantly, is to measure before theorising. So I stopped staring at the diff and ran perf stat against both versions.

$ perf stat -e cache-references,cache-misses,instructions,cycles ./agg-old
   1,204,118,442      instructions
     412,556,901      cache-references
      9,114,228       cache-misses     #  2.21% of all cache refs

$ perf stat -e cache-references,cache-misses,instructions,cycles ./agg-new
     902,447,310      instructions
     448,991,067      cache-references
     128,773,540      cache-misses     # 28.68% of all cache refs

There it was, in plain sight. The new version really did execute fewer instructions, about a quarter fewer. And it spent all of that saving, and then a great deal more, waiting on memory. Cache misses went from two per cent to nearly thirty. The CPU was not short of work to do. It was short of data to do it on, and stalling while the data crawled in from main memory.

A code listing on a dark terminal

why pointers and maps hurt here

The flat slice of structs is contiguous. When the CPU loads one record, the prefetcher pulls in the next few for free, because they sit next to each other in memory. The loop streams through, and the hardware does exactly what it is built to do: predict a linear access pattern and stay ahead of it.

The "clever" version threw that away. A map of IDs to records scatters the records across the heap. A slice of pointers means each step of the loop is a pointer dereference to somewhere unpredictable. The prefetcher cannot help, because there is no pattern for it to spot. Every record access becomes a potential trip to main memory, and main memory is, in CPU terms, agonisingly far away. A hundred-odd cycles for a miss, against a handful for an L1 hit. You can save a lot of instructions and still lose, comfortably, if you pay that toll often enough.

The "skip already processed records" optimisation I was so proud of was solving a problem the original did not really have. The linear scan I removed was cheap, because it was a linear scan over contiguous memory, which is the thing CPUs are fastest at. I had optimised the abstract operation count and pessimised the thing that actually governs wall-clock time on modern hardware: memory access patterns.

what i changed

The fix was boring, which is usually how you know it is right. I went back to the flat, contiguous slice. Where I genuinely needed dedup, I added a separate compact bitset indexed by a dense integer rather than a map, so the dedup check stays cache-friendly too. The records themselves stay packed and in order.

// records is a contiguous []record, ordered for the scan.
// seen is a bitset over dense ids, not a map[uint64]struct{}.
for i := range records {
    r := &records[i]
    if seen.test(r.id) {
        continue
    }
    seen.set(r.id)
    agg.add(r)
}

A second code listing showing a tight loop

That version is faster than both of my earlier attempts, and the perf numbers stopped being embarrassing. Cache misses back down under three per cent, instruction count in between the two, wall time the lowest of the lot.

The lesson is one I keep relearning. Big-O counts operations and assumes they all cost the same, and on real silicon they do not. A linear scan over contiguous memory will routinely beat a "smarter" algorithm that chases pointers around the heap, because the constant factor hiding behind that pointer chase is enormous. Layout is not a detail you tidy up afterwards. On a hot loop, layout often is the algorithm. I knew that already, in the way you know things you do not act on, and the profiler reminded me with a forty per cent regression and a quietly raised eyebrow.