the fast version was slower, and the cache told me why

A performance graph on a server

I had a hot loop that summed values out of a big struct-of-arrays, and I "improved" it. The improved version was 40% slower. Same machine, same data, fewer instructions on paper. It made no sense until I stopped looking at the algorithm and started looking at the memory.

The naive version walked one contiguous array start to finish. The clever version chased a pointer per element into a second structure to fetch a flag, then conditionally added. Cleaner code, obviously. Also a guaranteed cache miss every iteration, because that second structure was scattered all over the heap.

Here is roughly the shape of the two:

// fast in practice
for (size_t i = 0; i < n; i++)
    total += value[i];

// "smarter", much slower
for (size_t i = 0; i < n; i++)
    if (meta[i]->active)        // pointer chase, miss every time
        total += value[i];

The arithmetic is trivial in both. The difference is entirely in the second one waiting on RAM. A miss to main memory is on the order of a hundred-odd nanoseconds; an L1 hit is a couple of cycles. Do that per element across a few million elements and the maths does itself.

Code on screen

proving it rather than guessing

I did not want to trust a hunch, so I let perf settle the argument.

perf stat -e cache-references,cache-misses,instructions,cycles ./bench

The slow build had roughly the same instruction count but a cache miss rate up near 90%, and the cycles-per-instruction was the giveaway: the CPU was stalled, not working. Once you see CPI climb while instructions stay flat, you are not compute bound, you are waiting.

the fix was layout, not cleverness

I flattened active into its own contiguous array of bytes sitting right next to value, so the predictable prefetcher could do its job and both reads stayed warm.

for (size_t i = 0; i < n; i++)
    if (active[i])
        total += value[i];

Back to faster than the original, because now the branch is the only cost and the data is already in cache by the time the loop reaches it. No pointer chasing, no scattered allocations.

The lesson I keep relearning: on modern hardware the algorithm in your head is rarely the bottleneck. The memory access pattern is. Code that looks tidy can be cruel to the cache, and code that looks dumb (one flat array, walked in order) is often exactly what the prefetcher wants. Measure the misses before you trust the instruction count.