Here's the short version, because it's the part that matters: stop watching average latency. It is the single most reassuring and least useful number on your dashboard. It will sit there at a comfortable 40ms while a meaningful slice of your users are waiting two seconds, and it will do it with a straight face.
I learned this properly on a service that everyone agreed was "fast". The mean response time was about 45ms and the graph was flat and green. Meanwhile support kept getting the occasional ticket about pages that hung. The two facts sat next to each other for weeks because nobody connected them, because the average said there was nothing to connect.
the maths is against the average
Latency distributions are not symmetric. They can't be. There's a floor, you can't serve a request in negative time, but no ceiling, a request can always be slower. So the distribution has a long tail to the right, and the mean gets dragged toward that tail in a way that hides where most requests actually live.
Worse, the average smears the rare slow request across all the fast ones until it disappears. If 99 requests take 20ms and one takes two seconds, the mean is about 40ms. Looks fine. But one in a hundred of your users just had a genuinely bad time, and on a page that fires twenty backend calls, a fair number of page loads will hit that slow path at least once. The slow tail is contagious across a request that fans out.
percentiles tell you who's hurting
The fix is to stop summarising with one number and start looking at percentiles. p50 is the median, the experience of a typical request. p95 and p99 are the slow tail, the experience of your unluckiest users. The gap between p50 and p99 is the bit the average eats.
On that "fast" service, the numbers told a very different story once I pulled them:
p50 18ms
p90 31ms
p95 52ms
p99 1240ms
p999 3100ms
A median of 18ms and a p99 of over a second. The average of 45ms sat between them looking innocent. Roughly one request in a hundred was over a second, and one in a thousand was over three. On a busy service that's not a rounding error, it's thousands of bad experiences a day, neatly hidden by a green graph.
why you can't average percentiles
A trap I fell into early: you cannot average percentiles across hosts or time buckets. If box A has a p99 of 100ms and box B has a p99 of 900ms, the fleet p99 is not 500ms. Percentiles don't work like that. To get a real aggregate you need the underlying distribution, which in practice means histograms.
This is why the tooling that's getting popular now stores latency as histogram buckets rather than pre-computed percentiles. You record how many requests fell into each bucket, then compute the percentile across the merged buckets at query time. Prometheus does exactly this with its histogram type and histogram_quantile, and it's the right shape for the problem. You lose a little precision to the bucket boundaries and you gain the ability to actually aggregate honestly.
The practical cost is you have to choose your buckets to bracket the latencies you care about. Default buckets that top out at 10ms are useless on a service whose interesting behaviour is at the one-second mark. Set the boundaries around your tail, not your median.
coordinated omission, the one that gets everyone
There's a subtler trap that even careful people fall into, and Gil Tene has spent years banging the drum about it: coordinated omission. If your load generator sends a request, waits for the response, then sends the next one, it accidentally stops sending requests exactly when the system is slow. So your slowest moments are under-sampled, because the measuring tool politely backed off during them. Your p99 comes out flattering because the worst periods barely got measured.
The same thing happens in production monitoring that only times requests that completed. The requests that timed out, or that queued so long the user gave up and refreshed, often don't make it into the histogram at all. The very events you most want to see are the ones most likely to be missing. So treat a suspiciously clean tail with suspicion, and check whether your measurement is quietly excluding the bad cases.
what I actually changed
Three things, in order of how much they helped.
First, every latency dashboard now shows p50, p95 and p99 as separate lines, and the average is gone. Not demoted, gone. If I want a single headline number it's p99, because that's the promise I'm actually making to users.
Second, alerts fire on the tail, not the mean. A p99 creeping past a threshold is an early warning; a rising average is a thing you notice once it's already bad.
Third, I went and found that p99 on the original service. It turned out to be a connection pool that was too small, so under load a fraction of requests sat waiting for a free connection while the rest sailed through. Invisible in the average, obvious in the tail. Bumped the pool, and p99 dropped from 1.2 seconds to 60ms. The mean barely moved, which tells you everything about how much the mean was ever telling me.
The point isn't that percentiles are clever. It's that the average was always answering a question nobody asked. Your users don't experience the mean. They experience their own request, and some of them are out in the tail wondering why your fast service is so slow.