I kept reaching for Box<dyn Trait> and then feeling vaguely guilty about it, because the folklore says trait objects are slow: a vtable lookup on every call, no inlining, a pointer indirection where a generic would have given you a direct call the compiler can flatten. Generics get monomorphised, the argument goes, so each call site becomes concrete code the optimiser can chew on, and dynamic dispatch can't. All true. The question I'd never actually answered is: how much does it cost, in a workload that looks like mine rather than a microbenchmark designed to make the gap look big?
So I measured. The trait is trivial, a single method returning a number, and the work inside is just enough that the call isn't the entire cost. I ran it both ways with Criterion: once through &dyn Compute, once through a generic T: Compute so it monomorphises.
trait Compute {
fn run(&self, x: u64) -> u64;
}
fn via_dyn(c: &dyn Compute, xs: &[u64]) -> u64 {
xs.iter().map(|&x| c.run(x)).sum()
}
fn via_generic<C: Compute>(c: &C, xs: &[u64]) -> u64 {
xs.iter().map(|&x| c.run(x)).sum()
}
The generic version was faster, as expected, because the optimiser can inline run straight into the loop and then go to town: it sees the whole thing, unrolls a bit, drops the call entirely. The dyn version can't be inlined through, so the loop pays for a real call and a vtable lookup on every iteration. On my machine the generic ran roughly two and a half times faster on this tight loop.
Two and a half times. That sounds damning until you look at the absolute numbers. We're talking about a loop doing almost nothing per element, where the function call is nearly the whole cost, so removing it naturally dominates. The moment run does any real work, a hash, a comparison, anything touching memory, the dispatch cost shrinks to a rounding error against the work itself. I added a deliberately modest amount of actual computation inside run and the gap collapsed to single-digit percent, then to noise.
Which matches the thing I should have known going in. Dynamic dispatch costs you the call overhead and the lost inlining, and that only matters when the call overhead is a meaningful fraction of the total. In a hot inner loop over millions of trivial elements, use generics; the difference is real and free to take. Everywhere else, and that's most code, the trait object's flexibility, the smaller binary, the single non-monomorphised copy, is worth more than a few nanoseconds you'll never measure in production.
So I've stopped feeling guilty. I reach for generics when the dispatch is genuinely hot and I've got a benchmark saying so, and Box<dyn> the rest of the time without apology. The guilt was the wrong instinct; the right one is to measure the specific loop before optimising it, which is the answer to nearly every performance question and the one I keep having to relearn.