We had a dashboard that said average response time was 40ms, and we had users telling us the thing was occasionally unbearable. Both were true. The average was lying, not on purpose, just by being an average.
The problem with a mean is that it's a single number standing in for a distribution, and latency distributions are not symmetric. Most requests are fast. A few are catastrophic. The mean buries those few in a sea of fast ones, and the few are exactly the requests your users remember. If one request in a hundred takes two seconds and the rest take 20ms, your average is still about 40ms and looks fine, while one user in a hundred is staring at a spinner and deciding you're rubbish.
The fix is to stop looking at the average and look at percentiles. p50 is the median, the typical request. p99 is the line below which 99% of requests fall, which is to say it's the slow tail you can't see otherwise. When I plotted p99 next to the mean, the mean sat flat at 40ms and p99 spiked to nearly two seconds several times an hour. That spike was the whole story. The average had been quietly averaging it away.
There's a subtler trap I walked straight into next. Once you have percentiles, you start wanting to aggregate them, roll up p99 across several instances, or across a longer window. You cannot average percentiles. The mean of three instances' p99 values is not the p99 of the combined traffic, and it can be wildly wrong. A p99 of 100ms on a box doing ten requests a second and a p99 of 100ms on a box doing ten thousand are not the same evidence, and averaging them throws away exactly the weighting that matters.
The right way is to aggregate the underlying distribution and compute the percentile once, at the end, over everything. In practice that means histograms. Prometheus' histogram metrics store request counts in latency buckets, and histogram_quantile computes the percentile across whatever set of series you've summed. You sum the buckets first, then take the quantile. The buckets are an approximation, the boundaries matter, but it composes correctly, which the percentiles themselves do not.
A couple of things fell out of this that I now do reflexively. I never put a bare average latency on a dashboard any more; it goes next to p50, p90 and p99 or it doesn't go up at all. And when something is slow, I look at p99 first, because the average will tell me everything is fine right up until I've lost the user.
The mean isn't useless. It's just answering a different question than the one your users are asking. They don't experience your average. They experience their request, and some of them are living in the tail.