the fast version was slower, and the cache told me why

A line graph next to a rack of servers

I rewrote a hot loop, halved the instruction count, and made the whole thing slower. That sentence sat wrong with me for most of a Wednesday, so this is the writeup of how I climbed out of it.

The job is unglamorous: take a few million records, score each one against a small model, bin the results. It runs on a schedule and nobody thinks about it until it's late. It had been creeping past its window, so I had a poke. The inner loop walked an array of structs, pulled three fields out of each, did some arithmetic, and wrote a result. Classic.

My "improvement" was obvious and wrong. The struct had grown over the years: timestamps, flags, a couple of strings, the lot. The loop only touched three numeric fields. So I split the array of structs into separate arrays per field, the textbook structure-of-arrays move, on the theory that I'd stop dragging cold bytes through cache. Fewer fields per cache line, better locality, faster loop. Everyone knows this.

It ran 18% slower.

measure the thing, not your model of the thing

I had a theory and a stopwatch, which is how you fool yourself. The fix was to stop theorising and ask the CPU what it was actually doing. On Linux that's perf, and it costs nothing to run.

perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses ./scorer --bench

The old version and the new version, side by side:

old:  1.00e10 instructions   2.1% LLC miss
new:  0.55e10 instructions   6.8% LLC miss

Half the instructions, more than three times the last-level cache miss rate. The new code did less work and waited longer to do it. Instructions are cheap. A miss to main memory is two hundred-odd cycles of nothing, and the core just sits there.

Profiler output and a flame graph on screen

The reason, once I stopped being clever, was almost embarrassing. The three fields I wanted were adjacent in the original struct. One struct was 64 bytes, near as makes no difference one cache line, and the three hot fields lived in the same line. The array of structs was, accidentally, already laid out for the access pattern I had. By splitting it into three arrays I turned one cache line per record into three separate streams, each touching a different page, each with its own prefetcher to keep happy. The prefetcher is good but it isn't psychic, and three independent strides through three large arrays is exactly the workload that defeats it.

the actual fix

Once I understood that, the fix was tiny and slightly humbling: leave the layout alone, and just shrink the struct so the hot fields and nothing else share a line. I moved the strings and the rarely-touched flags into a side table keyed by index, so the main array became a tight 32-byte record. Cold data lives elsewhere, looked up only when something actually needs it, which is rarely.

// hot: 32 bytes, fits two per cache line
struct record {
    uint64_t id;
    float    score;
    float    weight;
    uint32_t bin;
    uint32_t flags_idx;   // index into the cold side table
    uint32_t _pad;
};

perf stat again, and now the miss rate sat under 1%. Throughput went up by a bit over a third versus where I started. Same algorithm. Same arithmetic. I just stopped making the memory subsystem do extra laps.

Memory layout diagram with cache lines highlighted

a detour into what the prefetcher actually does

It's worth being concrete about why three streams hurt, because the intuition "more arrays, more locality" is so seductive and so wrong here. Modern hardware prefetchers watch the addresses you touch and try to predict the next ones. Walk an array forwards at a fixed stride and the prefetcher catches on quickly: it sees you reading line N, then N+1, then N+2, and starts fetching ahead of you so the data's already in cache by the time the loop asks for it. One predictable stream is a prefetcher's happy place.

The catch is that there's a limited number of these streams the hardware can track at once, and each one consumes resources. When I split one array of structs into three parallel arrays, I went from one predictable stride to three, each marching through a different region of memory, each needing its own tracking slot, each competing for the same finite prefetch and cache budget. On top of that, three large arrays touch three times as many distinct pages, which leans harder on the TLB. None of this shows up in the instruction count. All of it shows up in the cycles spent stalled.

There's a tidy way to see this beyond the miss rate. perf stat will also give you instructions-per-cycle if you ask, and the gap between the two versions was stark: the old loop retired well over two instructions per cycle, the "improved" one barely managed one. The core was idle half the time, waiting on memory it had been promised would arrive and didn't. Same work, half the throughput, because the data wasn't where it needed to be when it was needed.

what I'd do differently next time

The honest takeaway is procedural, not technical. I changed the layout because a rule of thumb told me to, and I didn't measure the assumption underneath the rule. The assumption was "the loop drags cold bytes through cache". It was easy to check and I didn't check it, because checking is dull and being clever is fun.

The dull check is a single perf record followed by perf report, which would have shown me that the loop wasn't memory-bound to begin with, that the hot fields already shared a line, and that there were no cold bytes to drag because the struct was already smaller than I thought. The "improvement" was solving a problem the code didn't have, and creating one it didn't have before.

A couple of things I've taken from this and keep having to relearn.

Instruction count is a vanity metric on modern hardware. The machine will happily retire several instructions a cycle if the data is to hand, and stall for hundreds of cycles if it isn't. Optimising for fewer instructions while ignoring where the data lives is optimising the cheap thing.

Structure-of-arrays is a real and useful technique, but it earns its keep when you stream one or two fields over a huge dataset, or when you want to vectorise. It is not a free win you sprinkle on everything. If your hot fields already share a line, splitting them is strictly worse.

And the boring one: profile before and after, on the real input size, with hardware counters and not just wall-clock. A stopwatch tells you that you got slower. perf tells you why, and "why" is the only part you can act on. I spent an hour being clever and ten minutes reading counters, and only the ten minutes did any good.

The job finishes inside its window now and nobody thinks about it again, which is the highest praise a batch job can earn.