the average was fine and the customers were furious

A latency graph on a monitor

A service I looked after had an average response time of 40 milliseconds, and a steady trickle of complaints that it was slow. Both were true. That is the whole problem with averages, in one sentence, and it took me an embarrassingly long time to stop arguing with the complaints and start believing them.

The average is a liar by construction. It takes a request that returned in 8ms and a request that returned in 3,000ms and tells you the typical experience was 1.5 seconds, which described neither user. Worse, when most of your traffic is fast, a genuinely awful tail gets diluted into invisibility. Ten thousand requests at 20ms and a hundred at 4 seconds still average out to something that looks healthy on a dashboard. But that hundred is a hundred real people, and they are the ones writing in.

Code on a screen

The fix is not clever, it is just looking at the right number. Percentiles. p50 is the median, the genuinely typical request. p99 is the one that one request in a hundred is slower than, which is where the pain lives. When I plotted them as separate lines instead of a single averaged smear, the story changed completely. p50 sat happily at 18ms. p99 was 2.4 seconds and spiky. The service was fast for almost everyone and occasionally dreadful, and the average had been blending those two realities into a comfortable fiction.

A few things I had to internalise once I started thinking in percentiles.

You cannot average percentiles. If host A has a p99 of 100ms and host B has a p99 of 900ms, the fleet p99 is not 500ms, and Prometheus' histogram_quantile over a sum of buckets is the only honest way to get it. Averaging the per-host p99s is a tempting lie that gives you a number nobody actually experienced.

The tail is usually a different problem from the body. My p50 was about the code path. My p99 was about garbage collection pauses and a connection pool that exhausted under bursty load. Optimising the average would have meant shaving milliseconds off the fast path that was already fine. Optimising the tail meant fixing the pool, which is what the customers had been telling me all along.

And every layer adds its own tail. If a page makes ten backend calls and each has a 1% chance of being slow, the chance that page is slow is closer to 10%. Tail latency compounds, which is why p99 at the service level can feel like p90 to a user clicking through a real workflow.

The mean is a perfectly good number for capacity planning and a perfectly useless one for understanding user pain. If you only have room for one latency line on a dashboard, make it p99, not the average. The average will always tell you everything is fine. The customers, and p99, know better.