dyn or generic? i finally measured instead of guessing

A code editor showing a Rust trait and two call sites

I keep reaching for impl Trait and generic bounds out of a vague sense that they are "faster", and reaching for Box<dyn Trait> only when the type system forces my hand. That instinct has never once been backed by a number I measured myself, which is a poor look for someone who will happily lecture juniors about not optimising blind. So I sat down over the weekend and actually benchmarked it.

The short version, for those who do not want the whole journey: for the kind of work I write, the difference is real but small, and almost always swamped by whatever the function is actually doing. If you are dispatching once per HTTP request you will never measure it. If you are dispatching once per pixel, you might.

The two shapes

Static dispatch is the one Rust prefers. You write a generic function, the compiler monomorphises it, and every call site ends up with a concrete type and a direct, inlinable call.

trait Shape {
    fn area(&self) -> f64;
}

struct Circle { r: f64 }
struct Square { s: f64 }

impl Shape for Circle {
    fn area(&self) -> f64 { std::f64::consts::PI * self.r * self.r }
}
impl Shape for Square {
    fn area(&self) -> f64 { self.s * self.s }
}

// static dispatch: monomorphised, inlinable
fn total_static<S: Shape>(shapes: &[S]) -> f64 {
    shapes.iter().map(Shape::area).sum()
}

Dynamic dispatch goes through a trait object. The concrete type is erased behind a fat pointer (data pointer plus vtable pointer), and each area() call is an indirect jump through the vtable. The compiler cannot inline through it, which is the bit that actually costs you.

// dynamic dispatch: one function, vtable lookup per call
fn total_dynamic(shapes: &[Box<dyn Shape>]) -> f64 {
    shapes.iter().map(|s| s.area()).sum()
}

There is a real ergonomic difference too, not just performance. The static version cannot hold a Vec of mixed shapes; every element has to be the same S. The dynamic version can hold circles and squares side by side. Quite often that capability, not speed, is the thing that decides which you use, and I think people forget that the choice is frequently made for you by the data.

A criterion benchmark report open in a browser

Measuring it properly

I used criterion, because rolling your own timing loop in 2021 is a way to measure your own mistakes rather than the code. The trap with a micro-benchmark like this is that the optimiser will happily delete the whole thing if it can prove the result is unused, so the numbers come out as zero nanoseconds and you feel clever for about four seconds. black_box on the inputs and the result is what keeps it honest.

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench(c: &mut Criterion) {
    let circles: Vec<Circle> = (0..1000).map(|i| Circle { r: i as f64 }).collect();
    let boxed: Vec<Box<dyn Shape>> =
        (0..1000).map(|i| Box::new(Circle { r: i as f64 }) as Box<dyn Shape>).collect();

    c.bench_function("static", |b| {
        b.iter(|| total_static(black_box(&circles)))
    });
    c.bench_function("dynamic", |b| {
        b.iter(|| total_dynamic(black_box(&boxed)))
    });
}

criterion_group!(benches, bench);
criterion_main!(benches);

On my machine, release build, over a thousand elements, static dispatch came out around 1.1 microseconds for the whole iteration and dynamic around 1.7. So dynamic was roughly fifty percent slower in this hot, tight loop where the work per element (a multiply and an add) is trivially small. That fifty percent sounds alarming until you remember what it is fifty percent of: a few hundred nanoseconds spread across a thousand calls. We are talking fractions of a nanosecond per call.

The reason the gap exists at all is not the vtable lookup itself, which is cheap. It is that the lookup blocks inlining, and once area() cannot be inlined, the compiler also cannot autovectorise the loop or fold the constant PI multiply the way it does for the monomorphised version. The indirect call is a rounding error. The lost optimisations around it are the actual cost.

What changed my mind, slightly

The honest result is that the answer depends entirely on the size of the body. I reran with a heavier area(), one that did a chunk of real floating-point work, and the gap collapsed to noise: both versions were dominated by the arithmetic, and the dispatch overhead became unmeasurable. That is the case that actually matches almost everything I write. The functions behind my trait objects do real work: parse a request, hit a cache, format a response. The dispatch is a vanishingly small slice of the total.

So I have updated my rules of thumb, which were really just superstitions wearing a high-vis jacket.

If the trait method does meaningful work, use whichever shape makes the code clearer. The dispatch cost will not show up in a profile. Reach for dyn freely when you want a heterogeneous collection or a smaller binary.
If you genuinely have a tiny method called in a tight inner loop, millions of times, and a profiler has pointed a finger at it, then static dispatch can be worth the monomorphisation cost. That is also exactly the situation where you have a profiler result to justify the decision, rather than a vibe.
Generics are not free either. Monomorphisation bloats compile times and binary size, because the compiler stamps out a copy of the function for every concrete type. On a large codebase that is a real, ongoing tax you pay on every build, and dyn is sometimes the faster choice once you count the developer's time waiting for cargo build.

The thing I will take away is smaller than the benchmark and more useful: I had a strongly held performance belief that I had never tested, and when I tested it the answer was "it depends, and usually not enough to care". I suspect a fair number of my other strongly held performance beliefs are in the same boat. The benchmark took an afternoon. The guessing had cost me cleaner code for years.