I rewrote a hot loop to do less work, and it got slower. Not by a rounding error either: the "fast" version ran about 30% behind the one it replaced, reliably, on the same hardware, with the same input. That is the kind of result that makes you doubt your own benchmark before you doubt the code.
The original processed a list of records by walking an array of structs. Each struct held maybe forty bytes of fields, of which the loop touched two. The obvious improvement was to split those two fields out into their own tight arrays and iterate over those instead. Less data, fewer bytes per element, should be faster. Textbook structure-of-arrays. I made the change, ran the benchmark, and watched it lose.
The thing I had quietly broken was locality. In the original, the two fields I cared about sat next to each other inside the struct, so a single cache line fetch brought in both, plus a few neighbouring records for free. When I split them into separate arrays, I now had two independent streams to walk, and the access pattern that fed them was not sequential. It was driven by an index computed elsewhere, so each lookup jumped around. Two arrays, two sets of random-ish accesses, two streams of cache misses where there had been one mostly-predictable one.
perf stat made it embarrassingly clear. The "slow" original:
1,204,118,442 cache-references
18,332,901 cache-misses # 1.52 % of all cache refs
The "fast" rewrite:
1,331,067,615 cache-references
214,889,007 cache-misses # 16.14 % of all cache refs
Ten times the miss rate. I had cut the bytes touched and tripled the trips to main memory, and main memory is the expensive part. A cache miss costs you a couple of hundred cycles; the arithmetic I was trying to save costs you one or two. I optimised the cheap thing and pessimised the expensive thing.
The fix was not to undo the change but to fix the access pattern. The jumping-around was caused by processing records in their original order while the index pointed all over the split arrays. Sorting the work by that index first, so the two streams were walked roughly front to back, brought the miss rate back down under 3% and the rewrite finally beat the original, by a margin that actually justified the effort.
What I took away from it is that "less work" and "less memory traffic" are not the same axis, and on modern hardware the second one usually wins the argument. If you are going to reshape data for speed, profile the cache, not just the wall clock, because the wall clock will tell you that you got slower without ever telling you why. The why was sitting in cache-misses the whole time, waiting for me to look.