the fast version was slower, and the cache told me why

A performance graph on a server screen

I rewrote a hot loop last week to be cleaner. Fewer branches, an array of structs instead of three parallel slices, the sort of change you make because it reads better. It read better. It was also eleven percent slower, and I sat there for a good twenty minutes refusing to believe the benchmark.

The instruction count was lower. The new version genuinely did less work per iteration. So I stopped guessing and ran perf stat against both.

old: 0.4% cache-miss rate
new: 6.1% cache-miss rate

There it was. By packing the fields into one struct I'd made each record bigger, and the loop only ever touched one field of three. The old "ugly" layout kept that one field densely packed, so every cache line I pulled in was useful. My tidy struct dragged two fields I didn't need across the memory bus on every single iteration, and the prefetcher couldn't save me.

The lesson isn't "structs of arrays good, arrays of structs bad", because it depends entirely on your access pattern. The lesson is that on modern hardware the cost is almost never the instructions. It's whether the bytes you need are already close. The CPU can retire billions of operations a second and will still happily sit idle for hundreds of cycles waiting on RAM.

I kept the parallel slices. They're uglier and they're faster, and I left a comment explaining why so the next person (probably me) doesn't tidy it again.