A service I look after had an average response time of 40ms. The dashboard was green. The graph was flat. And users were complaining the thing felt slow. For a while I assumed they were imagining it, which is the first refuge of an engineer who trusts his dashboard more than his users. They were right and the dashboard was lying. Not maliciously. The dashboard was showing me an average, and averages are where latency goes to hide.
why the mean is useless here
The mean works beautifully for things that cluster around a middle. Latency does not cluster around a middle. It has a floor, because nothing can be faster than the work actually takes, and it has a long, ugly tail stretching off to the right because anything can go wrong: a slow disk, a lock, a garbage collection pause, a retry, a noisy neighbour on the box.
So your distribution looks nothing like a bell curve. It looks like a wall on the left and a long thin smear to the right. Take the mean of that and you get a number that describes nobody. Most requests are faster than the mean. A few are vastly slower. The mean sits in the empty gap between, a statistic with no constituents.
Here is the example that finally made it click for me. Imagine a hundred requests. Ninety-nine of them take 10ms. One of them takes 3 seconds, because a connection pool was exhausted and it waited.
99 requests × 10ms = 990ms
1 request × 3000ms = 3000ms
----------
total = 3990ms
mean = ~40ms
There's my 40ms. Looks healthy. But one in a hundred of my users just waited three seconds, and on a busy page that fires a dozen requests, a meaningful fraction of page loads hit that slow path at least once. The average smeared one genuinely terrible experience across ninety-nine fine ones and called the result "fine".
percentiles tell you the truth
The fix is to stop asking "what's the typical latency" and start asking "what latency do my users actually experience, including the unlucky ones". That's percentiles.
- p50, the median: half of requests are faster than this. The honest "typical".
- p95: 95% are faster. The other 5% are worse. This is where the tail starts to bite.
- p99: 99% are faster. The 1% above this are your three-second connection-pool victims.
- p99.9: the truly unlucky. At any real scale, this is a lot of people.
For my service, p50 was 9ms, p95 was 30ms, and p99 was 2.8 seconds. There it was. The complaint, sitting in plain view, completely invisible to the average. The shape of the distribution was the whole story, and the mean had thrown that shape away.
the bit nobody tells you about percentiles
You cannot average percentiles. This catches people out constantly, including me, more than once. If box A reports a p99 of 100ms and box B reports a p99 of 200ms, the p99 of the combined traffic is not 150ms. There is no arithmetic you can do on those two numbers to recover the true combined p99. The percentile is a property of the full set of measurements, and you threw the measurements away when you collapsed them into a percentile.
This matters the moment you have more than one instance, which is to say always. If your metrics pipeline computes a p99 per host and then averages those across the fleet, the number on your dashboard is fiction. A comforting fiction, usually lower than reality, which is the worst kind.
The honest way is to ship the underlying distribution, not the summarised percentile. Histograms with fixed buckets do this: each host reports counts per latency bucket, you sum the buckets across hosts, and you compute the percentile from the merged histogram. Prometheus histograms work exactly this way, which is why histogram_quantile operates over summed bucket counts and not over pre-baked numbers. It costs you a little cardinality and buys you a percentile you can actually trust across a fleet.
what to actually do
Stop alerting on mean latency. It will be green while users suffer.
Put p50, p95 and p99 on the same graph. The gap between them is the shape of your tail, and the tail is where your reputation lives. When p50 is flat and p99 climbs, you have a tail problem: a slow dependency, lock contention, GC, something that hits a minority hard rather than everyone a little. When p50 itself rises, the whole service is slowing and that's a different, usually more honest, conversation.
And go and find the tail. The mean told me everything was fine for weeks. The p99 told me, in about ten seconds, that one connection pool was too small. I bumped the pool, the three-second tail vanished, and the average barely moved, because of course it didn't. It never knew the tail was there.
The average is a politician. It gives you a comfortable number and hides the people it's failing. The percentiles are the auditor. Listen to the auditor.