the average is fine, the p99 is on fire

A latency graph on a monitoring dashboard

A service I look after had a mean response time of 40ms, and a steady stream of complaints that it was slow. Both were true. That is the whole problem with averages: they are the answer to a question nobody asked.

The mean blends every fast request into the few slow ones until the slow ones vanish. If 99 requests come back in 20ms and one takes 2 seconds, your average is a respectable 40ms and one in a hundred of your users has just sat watching a spinner. They do not feel the mean. They feel their own request, and a meaningful slice of them are sitting in the tail you have averaged away.

So I stopped looking at the mean and started plotting percentiles. The p50 was indeed lovely. The p99 was a cliff. That cliff turned out to be a connection pool that exhausted under bursts, so most requests sailed through and a steady unlucky minority queued for a connection. You could not see it in the average because the average is, by construction, designed not to show you that.

These days the first thing I add to any dashboard is p50, p95 and p99 on the same axis, and I treat the gap between them as the actual health signal. If p50 and p99 track each other, the service is honestly fast or honestly slow. When they diverge, the average is lying to you, politely, the way averages do.