We had a service with a mean response time of about 40ms, which on paper is lovely, and a steady trickle of complaints that it felt slow, which on paper makes no sense. The dashboard showed a flat, healthy average. The users were not imagining it. The average was the problem, or rather, my trust in it was.
An average flattens everything into one number, and that number is dominated by the many fast requests while it quietly absorbs the few catastrophic ones. If 99 requests take 20ms and one takes 2 seconds, the mean is about 40ms. That mean describes none of the actual experiences. It describes a request that never happened.
look at the distribution
The fix is not clever, it's just looking at percentiles instead of the mean. The p50 tells you the typical request. The p99 tells you what your unluckiest one-in-a-hundred users are living through, and at any reasonable traffic level that's a lot of real people every minute.
Our numbers told the story immediately once I plotted them:
p50: 18ms
p90: 31ms
p99: 840ms
p999: 2.1s
The body of the distribution was genuinely fast. The tail was on fire. And tails matter more than they look, because a single page load fans out into many backend calls, so the slowest call in the set sets the pace. Fan out to ten services and your page is hostage to each one's p99, which compounds into the page's p90 feeling far worse than any single service's average suggests.
The tail itself turned out to be the boring usual suspects: a connection pool that was a touch too small, so under bursts some requests queued for a free connection, plus occasional garbage collection pauses lining up badly with those bursts. Neither moved the average a millimetre. Both lived entirely in the p99.
The practical change was to alert on p99 rather than the mean, and to make sure the metrics pipeline computed real percentiles rather than averaging pre-bucketed averages, which is its own special way of lying. You cannot average percentiles across hosts and get a meaningful number back; that one catches people constantly.
I still keep the average on the dashboard. It's a fine summary for capacity trends. But the moment someone says "it feels slow" and the mean says "all is well", the mean is not your witness. Go and ask the p99 what it saw.