the fast version was slower, and the cache was why

A performance graph on a server dashboard

The point first: I rewrote a hot loop to be obviously more efficient, with fewer comparisons and a smarter lookup, and it got slower. Not by a rounding error, by about 30%. The reason was cache misses, and the only way I found it was to stop reading the code and start measuring the silicon.

The setup is a thing I do a lot: I have a few hundred thousand records in memory, and I'm doing a pass over all of them looking up a value per record. The original version stored everything in a flat Vec<Record> and did a linear scan with an early-exit comparison. Crude. The kind of thing you look at and think, I can do better than that.

So I did better than that. I built a HashMap<Key, Record> so the lookup was O(1) instead of O(n), replaced the scan, ran the benchmark, and watched it get worse. My first assumption was that I'd botched the benchmark. I hadn't. My second assumption was hash overhead, that hashing the key cost more than the comparisons I'd saved. Closer, but not the real story.

measuring instead of guessing

This is where you put the editor away and reach for perf. On the loop in question:

$ perf stat -e cache-references,cache-misses,instructions,cycles ./bench

   1,204,883,012   cache-references
     412,556,901   cache-misses        #   34.24% of all cache refs
   8,991,204,455   instructions
  14,203,889,712   cycles

A 34% cache-miss rate is the tell. The fast version was executing fewer instructions, just as I'd intended, but it was spending most of its cycles stalled, waiting on main memory. The CPU was idle, dressed up as busy.

Source code on a screen

Here is the bit I'd forgotten. A Vec<Record> is contiguous. When you walk it linearly, the hardware prefetcher sees the access pattern, fetches the next cache lines before you ask for them, and by the time the loop wants record N+1 it is already sitting in L1. The linear scan was "slow" on paper and beautifully cache-friendly in practice. A modern cache line is 64 bytes, and if your records are small you get several of them per line for free.

The HashMap threw all of that away. The hash scatters records across the bucket array with no relationship between key order and memory order, so each lookup is a jump to a more-or-less random address. The prefetcher cannot predict random. Every lookup became a likely L2 or L3 miss, and an L3 miss is in the order of 60 to 100ns, which is geological compared to the handful of nanoseconds a cache-resident comparison costs. I had traded a predictable pattern the hardware loves for an algorithmically superior one the hardware cannot help with.

the fix was less cleverness, not more

The honest fix was to go back towards the flat layout. The lookups were not actually random in my workload, the keys arrive roughly sorted, so I sorted the Vec once and used a branch-predictable scan with the early exit intact. For the cases that genuinely needed random access I kept a small index of (key, position) pairs, itself a flat sorted slice I binary-search, so the index stays cache-resident and the records stay contiguous.

$ perf stat -e cache-references,cache-misses ./bench

     308,114,200   cache-references
      19,442,001   cache-misses        #    6.31% of all cache refs

Six percent instead of thirty-four. The loop is now faster than both my clever version and the original naive one, because it does few comparisons and keeps the prefetcher fed.

More source code on a screen

The lesson I keep having to relearn: Big-O counts operations, and operations are not the thing you pay for. You pay for cycles, and on current hardware a cache miss can cost more than a hundred ordinary operations. A data structure with a worse asymptotic complexity but a contiguous memory layout will frequently beat a theoretically superior one that scatters its data across the heap. Mechanical sympathy is the unfashionable name for it, and it is real.

None of this is a knock on hash maps. They are the right answer when the working set is large and access genuinely is random and the alternative is a real linear scan over millions of items. My mistake was reaching for one reflexively, optimising the part of the program a textbook cares about while ignoring the part the CPU cares about. The CPU, it turns out, has opinions, and perf stat is how you ask what they are. Look at cache-misses before you trust your instincts about what's fast.