the average is fine, which is exactly the problem

A latency graph on a monitoring dashboard

The dashboard said 45ms. The customer said the site was slow. One of us was lying, and it turned out to be the dashboard, or rather the person who'd built it, which was me.

The graph showed mean response time and it sat reassuringly flat at around 45 milliseconds all day. By that number the service was healthy. But "mean response time" is the statistic you reach for when you want to feel good rather than learn something. An average flattens a distribution into a single comfortable lump and quietly buries everything interesting in the tail.

So I pulled the same data and plotted percentiles instead. The p50 was indeed about 40ms, which matched the average closely enough to explain why the average looked fine. The p99, though, was 1.8 seconds. One request in a hundred was taking the better part of two seconds, and the mean barely twitched because ninety-nine fast requests drown out one slow one every time.

A terminal showing latency percentile output

One in a hundred sounds rare until you count requests. A page that makes twenty backend calls to render will, on average, hit that slow tail on most page loads. The user doesn't experience your p50. They experience the slowest thing their page had to wait for, and on a busy page that's effectively the p99 every single time. The average response time describes a request that nobody actually makes.

Finding the cause was the easy part once I knew to look. The slow tail correlated neatly with a connection pool that was occasionally exhausted, so the odd request sat waiting for a free connection before it could even start. Mean latency couldn't see it because the waiting requests were a tiny fraction of the total. The p99 saw it immediately, because the p99 is precisely the place where pool exhaustion lives.

I've since made a rule for myself: no latency graph ships with just an average on it. p50, p90, p99, and ideally the max, all on the same axis. The gap between p50 and p99 tells you more about the health of a system than either number alone. If they track closely you have a predictable service. If p99 floats miles above p50 you have a tail, and the tail is where your angriest users are sitting.

Averages aren't useless. They're just the wrong tool for "is this fast enough for everyone", because everyone includes the unlucky person on the slow end, and the mean has politely averaged them out of existence.