the optimisation that made everything slower

A line graph of server metrics climbing across a dark dashboard

Here's the short version: I rewrote a hot loop to do less work, and it got slower. Roughly 30% slower, consistently, on the same machine with the same input. The new code executed fewer instructions and allocated less. By every measure I'd reach for first, it should have been faster. It wasn't, and the reason was cache misses, which is the part of performance work nobody warns you about until it ruins your afternoon.

what the code did

The job is dull: a simulation step over a few million entities, run many times. Each entity has a position, a velocity, and a pile of other state, and on every step we update the position of those that are active. The original was a plain array of structs, looped front to back.

The "optimisation" was the kind of thing that looks obviously correct in a code review. The active set was usually a small fraction of the total, so why iterate over everything? I added an index of active entity IDs and looped over that instead, jumping into the main array by ID. Fewer entities touched, fewer branches, less work. Ship it.

Except the ship was slower.

A code editor showing a struct-of-arrays refactor

measuring instead of guessing

The first rule when an optimisation surprises you is to stop reasoning about it and go and measure. I'd convinced myself the indexed version had to be faster, and that conviction was worth nothing. So I ran both under perf:

perf stat -e cycles,instructions,cache-references,cache-misses ./sim --steps 5000

The numbers were stark. The new version did execute fewer instructions, about 18% fewer, exactly as I'd hoped. It also had roughly four times the cache misses. The instructions-per-cycle figure had cratered. The CPU was spending most of its time waiting for memory rather than doing anything, and no amount of "fewer instructions" helps when each one stalls for a couple of hundred cycles fetching a cache line from main memory.

That's the whole story in one sentence: the old version walked memory in order, the new one jumped around it.

why ordered access wins

Modern CPUs are extremely good at streaming through contiguous memory. The hardware prefetcher spots a linear access pattern and pulls the next cache lines in before you ask for them, so the data is sitting in L1 by the time the loop reaches it. The original front-to-back walk was a gift to the prefetcher. Even though it touched inactive entities it didn't care about, it touched them in order, the lines were already warm, and the wasted work was nearly free.

My indexed version destroyed that. The active IDs weren't sorted, so jumping into the main array by ID was effectively random access across a structure far larger than the cache. Every lookup was a fresh miss. The prefetcher had no pattern to follow. I'd traded a small amount of "useless but free" linear work for a large amount of "useful but ruinously expensive" random work. The CPU does not thank you for being clever about which bytes you touch if you touch them in an order it can't predict.

the actual fix

Two changes got it properly fast, and neither was the index.

First, I sorted the active IDs before the loop. That alone recovered most of the loss, because sorted IDs turn random access back into a mostly-forward walk, and the prefetcher started doing its job again. Sorting costs something, but it's a cheap, cache-friendly operation amortised over the whole step, and it paid for itself many times over.

Second, and this is where the real win lived, I changed the layout from an array of structs to a struct of arrays. Position got its own contiguous array, velocity another, and the rarely-touched state moved out of the hot path entirely. The update loop now streams through two tight arrays of exactly the data it needs, with nothing else sharing the cache lines.

// before: array of structs, lots of cold fields on every line
struct Entity { pos: Vec3, vel: Vec3, /* ...lots more... */ }
entities: Vec<Entity>

// after: struct of arrays, hot data packed together
positions:  Vec<Vec3>,
velocities: Vec<Vec3>,
// cold state lives elsewhere

With the cold fields gone from the hot lines, each cache line now carries far more of the data the loop actually uses, so fewer fetches cover the same work. The combination, sorted access plus struct of arrays, ended up about twice as fast as the original. The original I'd been trying to beat in the first place.

what I took away

A few things stuck.

Instruction count is a poor proxy for speed once you're memory-bound. The question is rarely "how much work" and almost always "in what order, and how densely packed".
The hardware prefetcher is the quiet hero of most fast loops. Linear, predictable access is worth optimising for, even when it means doing nominally redundant work.
Measure before and after, every time. I'd have shipped a 30% regression on the strength of an argument that sounded airtight.

The annoying part is that the indexed version really would be faster in a language with a smaller working set, or if the active fraction were tiny and the IDs already clustered. Performance is contextual. The only thing that travels is the discipline of going and looking at cache-misses before believing your own theory.

A flame graph and a terminal full of perf counters

The "optimisation" stayed in the codebase, sort of. It's just that what made it fast had nothing to do with the idea I started with, and everything to do with how the bytes were arranged. That's usually how it goes.