A dashboard once told me a service had an average response time of 40ms, and the on-call rota told me a different story entirely. Both were correct. That gap is the whole point of this post.
The mean is a single number doing the work of millions of requests, and it does that work by quietly drowning the bad ones. If 99 requests come back in 20ms and one takes 2 seconds, your average is about 40ms, which sounds lovely and describes precisely nobody's experience. The 99 fast users don't notice 20ms. The one slow user notices 2 seconds very much, writes in about it, and is statistically invisible on a chart of means.
Percentiles fix this by refusing to average anything away. p50 is the median, the experience of a typical request. p99 is the line below which 99% of your requests fall, which is to say the slowest 1% are worse than this. That slow 1% is not a rounding error you get to ignore. On a busy service it's thousands of real requests an hour, and it's disproportionately the ones that hit a cold cache, a GC pause, a slow downstream, or a lock. It's where the actual pain lives.
There's a second reason the tail matters more than it looks, and it's compounding. A single user action rarely makes one request. It makes ten, or fifty, fanning out to services that each have their own tail. If every call has a 1% chance of being slow, the chance that at least one of fifty calls is slow is not 1%, it's nearly 40%. So a p99 that sounds rare at the level of one request becomes the common case at the level of one user, and the slow tail you waved away turns out to be most of what people actually feel.
The trap that follows is averaging the percentiles themselves. You cannot take the p99 from ten machines and average them to get a fleet p99. Percentiles don't add up like that, and the result is a number that looks plausible and means nothing. You need to aggregate the underlying distribution, which is why histogram-based tooling exists and why summary-based p99s reported per-instance quietly mislead you the moment you roll them up. The same caution applies across time: you can't average a minute of p99s into an hourly p99 either. Keep the buckets, merge the buckets, then read the percentile off the merged result.
So the rule I keep coming back to: look at p50 to understand the typical case, p99 to understand the case that generates complaints, and p99.9 if you're running anything where the long tail compounds across many calls in a single user action. And never, ever let a single average stand in for a distribution. The average is the number that makes a bad service look acceptable, right up until someone pages you about the 2 seconds it forgot to mention.