your average latency is lying to you

A latency graph on a monitoring dashboard

Here's the short version, because it's the part that matters: stop watching average latency. It is the single most reassuring and least useful number on your dashboard. It will sit there at a comfortable 40ms while a meaningful slice of your users are waiting two seconds, and it will do it with a straight face.

I learned this properly on a service that everyone agreed was "fast". The mean response time was about 45ms and the graph was flat and green. Meanwhile support kept getting the occasional ticket about pages that hung. The two facts sat next to each other for weeks because nobody connected them, because the average said there was nothing to connect.

the maths is against the average

Latency distributions are not symmetric. They can't be. There's a floor, you can't serve a request in negative time, but no ceiling, a request can always be slower. So the distribution has a long tail to the right, and the mean gets dragged toward that tail in a way that hides where most requests actually live.

Worse, the average smears the rare slow request across all the fast ones until it disappears. If 99 requests take 20ms and one takes two seconds, the mean is about 40ms. Looks fine. But one in a hundred of your users just had a genuinely bad time, and on a page that fires twenty backend calls, a fair number of page loads will hit that slow path at least once. The slow tail is contagious across a request that fans out.

percentiles tell you who's hurting

The fix is to stop summarising with one number and start looking at percentiles. p50 is the median, the experience of a typical request. p95 and p99 are the slow tail, the experience of your unluckiest users. The gap between p50 and p99 is the bit the average eats.

On that "fast" service, the numbers told a very different story once I pulled them:

p50   18ms
p90   31ms
p95   52ms
p99   1240ms
p999  3100ms

A median of 18ms and a p99 of over a second. The average of 45ms sat between them looking innocent. Roughly one request in a hundred was over a second, and one in a thousand was over three. On a busy service that's not a rounding error, it's thousands of bad experiences a day, neatly hidden by a green graph.

A terminal showing latency percentile output

why you can't average percentiles

A trap I fell into early: you cannot average percentiles across hosts or time buckets. If box A has a p99 of 100ms and box B has a p99 of 900ms, the fleet p99 is not 500ms. Percentiles don't work like that. To get a real aggregate you need the underlying distribution, which in practice means histograms.

This is why the tooling that's getting popular now stores latency as histogram buckets rather than pre-computed percentiles. You record how many requests fell into each bucket, then compute the percentile across the merged buckets at query time. Prometheus does exactly this with its histogram type and histogram_quantile, and it's the right shape for the problem. You lose a little precision to the bucket boundaries and you gain the ability to actually aggregate honestly.

The practical cost is you have to choose your buckets to bracket the latencies you care about. Default buckets that top out at 10ms are useless on a service whose interesting behaviour is at the one-second mark. Set the boundaries around your tail, not your median.

coordinated omission, the one that gets everyone

There's a subtler trap that even careful people fall into, and Gil Tene has spent years banging the drum about it: coordinated omission. If your load generator sends a request, waits for the response, then sends the next one, it accidentally stops sending requests exactly when the system is slow. So your slowest moments are under-sampled, because the measuring tool politely backed off during them. Your p99 comes out flattering because the worst periods barely got measured.

The same thing happens in production monitoring that only times requests that completed. The requests that timed out, or that queued so long the user gave up and refreshed, often don't make it into the histogram at all. The very events you most want to see are the ones most likely to be missing. So treat a suspiciously clean tail with suspicion, and check whether your measurement is quietly excluding the bad cases.

what I actually changed

Three things, in order of how much they helped.

First, every latency dashboard now shows p50, p95 and p99 as separate lines, and the average is gone. Not demoted, gone. If I want a single headline number it's p99, because that's the promise I'm actually making to users.

Second, alerts fire on the tail, not the mean. A p99 creeping past a threshold is an early warning; a rising average is a thing you notice once it's already bad.

Third, I went and found that p99 on the original service. It turned out to be a connection pool that was too small, so under load a fraction of requests sat waiting for a free connection while the rest sailed through. Invisible in the average, obvious in the tail. Bumped the pool, and p99 dropped from 1.2 seconds to 60ms. The mean barely moved, which tells you everything about how much the mean was ever telling me.

The point isn't that percentiles are clever. It's that the average was always answering a question nobody asked. Your users don't experience the mean. They experience their own request, and some of them are out in the tail wondering why your fast service is so slow.