I rewrote a hot loop to do less work and it got slower. Not by a rounding error: the "optimised" version was meaningfully behind the naive one it replaced, on the same data, on the same box. That is the kind of result that makes you doubt your benchmark, then doubt your compiler, then eventually doubt yourself, in roughly that order.
The job was summing a value across a large grid of records. The original did the obvious thing: walk the array in order, add as you go. My clever version split the work to avoid recomputing an index each iteration and to skip a branch I thought was costing me. On paper it executed fewer instructions. In practice it was about 30% slower.
Fewer instructions, slower runtime. That combination almost always means you've stopped paying with CPU cycles and started paying with memory latency. The CPU isn't the bottleneck; the wait for data to arrive from RAM is. So I stopped guessing and asked perf what was actually happening.
perf stat -e cycles,instructions,cache-references,cache-misses ./bench
The numbers were stark. The naive version had a cache miss rate down in the low single digits. My "improved" version was missing the last-level cache several times more often. It was executing fewer instructions and then stalling on every other one of them, waiting for a line to come back from main memory, which on this machine costs the better part of a hundred cycles. You can save a handful of instructions and lose a hundred cycles in the same breath. That trade never comes out ahead.
The cause was access order. My restructuring had, without my noticing, changed the loop so it strode through memory rather than reading it sequentially. The original walked one cache line, used everything in it, moved to the next. Mine touched one element here, jumped a long way, touched one element there, and so dragged in a fresh 64-byte line to use 8 bytes of it before throwing it away. The prefetcher loves a straight sequential walk and it had no idea what mine was doing.
The fix was simply to walk the data in the order it's laid out in memory and let the arithmetic fall where it may. The "wasteful" index recomputation I'd been so pleased to eliminate costs nothing, because the CPU does it whilst it would otherwise be sitting idle waiting on the next line anyway. There is spare compute hiding in every memory stall, and you may as well spend it.
Two things I took away and keep having to relearn:
- On modern hardware, the layout and order of your memory access usually matters more than the instruction count. The cheap thing is arithmetic. The expensive thing is a trip to RAM.
- "Optimised" is a claim you measure, not a property you assert.
perf statfor thirty seconds would have stopped me shipping a regression I was proud of.
I reverted to something very close to the naive loop, kept one genuine improvement that didn't disturb the access pattern, and moved on a little humbler. The fast version was slow because it was clever in the one dimension the machine doesn't reward.