cache misses, and why the fast version was slower

A performance graph climbing across a dark dashboard

I rewrote a hot loop to be clever and it got slower. Not by a little. The "optimised" version took about forty percent longer than the naive one it replaced, which is the sort of result that makes you doubt your own benchmark harness before you doubt your code.

The original iterated over an array of structs in order and did its work in place. The new version, in the name of tidiness, gathered pointers into the structs and chased them. It was prettier. It was also pointer-chasing across memory in an order the CPU could not predict, and that is the whole story.

perf stat told me in one line. The clean version had a cache miss rate several times higher than the ugly one:

perf stat ./bench
   12,883,440  cache-misses   #   31.20% of all cache refs

Thirty-one percent. The naive version sat under five. The CPU was spending most of its time waiting on memory it could not prefetch, because I had taken a beautiful sequential access pattern and shredded it into random hops across the heap.

I reverted. The array-of-structs, walked in order, was already doing the right thing: predictable, sequential, prefetcher-friendly. My cleverness was the bug. The lesson I keep relearning is that on modern hardware the access pattern usually matters more than the instruction count, and the fastest loop is very often the boring one that lets the prefetcher do its job. Measure the cache misses before you congratulate yourself.