Cache Misses, and Why the Fast Version Was Slow

A latency graph on a screen

I rewrote a hot loop last week to be faster, benchmarked it, and it was slower. Not by a rounding error. By about forty per cent. The new version did strictly less work, allocated less, and had a tighter inner loop. It should have walked it. Instead it lost, repeatedly and reproducibly, and I spent the better part of a day finding out why. The short version is that I had optimised the instructions and pessimised the memory, and on modern hardware the memory is what matters.

The change that should have been a win

The job is unglamorous: walk a large array of records, pull a couple of fields out of each, and accumulate some totals. The original used an array of structs. The "fast" rewrite split it into separate arrays per field, the classic struct-of-arrays transform, because the textbook says that is more cache-friendly when you only touch some of the fields. And it is, when you do it right.

I did not do it right. In the original, the records were a flat slice, contiguous in memory, walked front to back. In my rewrite I had built the field arrays by appending into maps keyed by record ID and then iterating the map. That iteration order is, for all practical purposes, random. So I had taken a beautifully sequential access pattern and turned it into a scatter of pointer chases all over the heap.

A diagram of sequential versus scattered memory access

The CPU did not care that I was executing fewer instructions. It cared that almost every access was a cache miss, and a main-memory fetch costs you a few hundred cycles while the prefetcher sits there unable to guess where you are going next. You can do an enormous amount of arithmetic in the time it takes to wait on one cache line from DRAM. I had traded cheap instructions for expensive waits and felt clever doing it.

Seeing it rather than guessing

I could have argued about this in the abstract forever. What ended it was perf stat, which on Linux will just tell you. Running both versions under it made the problem impossible to ignore:

perf stat -e cache-references,cache-misses,instructions,cycles ./bench

The rewrite executed fewer instructions, exactly as I had hoped, and had a cache miss rate several times higher. The instructions-per-cycle number, the IPC, told the real story. The original sat comfortably above two. The rewrite was scraping along below one, stalled, waiting on memory it could not predict. Fewer instructions, but each one taking far longer because the data was not there when it was needed.

A perf stat output comparison

There is a humbling lesson buried in that. Instruction count is the metric we instinctively reach for because it is easy to reason about. We count the loops, we count the branches, we feel like we understand the cost. But on a machine where a cache miss is two orders of magnitude more expensive than an add, your mental model of "less code is faster" quietly stops being true. The thing you cannot see in the source, the access pattern, dominates the thing you can.

What actually fixed it

The fix was to keep the struct-of-arrays idea but build the arrays in a single contiguous pass, in the original order, and never put them in a map at all. Allocate the field slices up front at the known length, fill them by index as I walk the source once, then iterate them straight through. Sequential in, sequential out. The prefetcher loves a predictable stride and rewards you handsomely for giving it one.

ids := make([]uint64, n)
amounts := make([]int64, n)
for i, r := range records {
    ids[i] = r.ID
    amounts[i] = r.Amount
}

That is it. No map, no random order, no scatter. With that change the rewrite finally did what I had promised it would, comfortably beating the original because now it had both the better layout and the sequential access. The IPC came back up above two, the cache misses dropped to where they should have been, and the forty per cent regression turned into a respectable win.

A few things I am taking away from this, beyond the obvious bruise to my ego. Benchmark the thing you actually changed, in isolation, before you celebrate, because I nearly shipped the slow version convinced it was fast. Reach for perf stat early, because IPC and cache misses turn a vague "why is this slow" into a number you can act on. And treat memory layout as a first-class part of the design, not an afterthought, especially in any loop that runs over a lot of data. The compiler will optimise your instructions. It will not fix the order in which you decided to touch memory. That part is on you, and on modern hardware it is usually the part that decides whether you win.