Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the average latency was fine, which is why everyone was angry

Why a healthy-looking mean response time hid a tail of slow requests, and how percentiles told the story the average buried.

A line graph of server response times on a dark dashboard

Our dashboard said the average response time was 80 milliseconds. Users said the site was slow. Both were telling the truth, and the gap between them is the whole point of this post.

The mean is a terrible way to talk about latency, because latency distributions are not symmetric. Most requests are fast, a few are catastrophically slow, and the mean smears the slow ones across the fast ones until the number looks healthy. If 99 requests take 50ms and one takes 3 seconds, your average is about 80ms. The number is technically correct and entirely useless. One in a hundred of your users just waited three seconds, and they're the ones filing tickets.

What you want is percentiles. The p50 (median) is your typical experience. The p99 is the experience of your unluckiest one-in-a-hundred requests, and on a busy site that's a lot of real people. The p99.9 is where the genuinely nasty stuff hides. The shape of these tells you far more than any single average ever will.

A code editor showing a latency histogram calculation

When I actually plotted ours, the median was 45ms, perfectly fine, but the p99 was 2.8 seconds. That's the gap. The average sat low because the median dominated the count, but a meaningful slice of traffic was falling off a cliff. The tail is where the user pain lives, and the tail is exactly what the mean is designed to hide.

A couple of things worth knowing once you start caring about tails.

First, you can't average percentiles. If one server reports a p99 of 200ms and another reports 400ms, the combined p99 is not 300ms. Percentiles don't add up that way. You either compute them from the raw data across both, or you use a structure built for it (histograms, t-digests) that can be merged correctly. A lot of dashboards quietly average per-host percentiles and produce a number that means nothing.

Second, tail latency compounds. If a single page makes ten backend calls and waits for all of them, then your page is as slow as the slowest of the ten. Even if each call has a tidy p99, the page sees the tail far more often, because you're rolling the dice ten times. This is why a service that looks fine in isolation can make a product feel sluggish: fan-out multiplies your exposure to the tail.

The cause of ours, when we dug in, was unsurprising: a cache that occasionally missed and fell through to a slow query, plus a connection pool that was a touch too small so requests sometimes queued waiting for a connection. Both were invisible at the mean and obvious at the p99.

So watch the percentiles, not the average. Put p50, p99 and p99.9 on the dashboard and let the mean retire quietly. The average makes a comforting headline, but it's the tail that's writing your support tickets.