the average that hid the outage

A latency graph with a calm average line hiding a spiky tail

A graph told me the service had an average response time of 40 milliseconds and the service was, by any sane definition, on fire. Both statements were true at once, which is the entire problem with averages and the reason I have spent years gently nagging people to stop putting the mean on the dashboard.

Latency is not normally distributed. It has a floor, because nothing returns in negative time, and a long ragged tail, because anything can go slow: a cache miss, a GC pause, a slow disk, a retry, a noisy neighbour. So you get a great heap of fast responses and a thin smear of slow ones stretching off to the right. The mean takes that whole lopsided shape and squashes it into one number that sits somewhere nobody actually lives. Your typical request is faster than the average. Your worst requests are far, far slower. The average describes neither.

A histogram showing a tight cluster of fast responses and a long slow tail

What you want is percentiles. The p50, the median, tells you where the middle request really sits. The p99 tells you what your unlucky one-in-a-hundred experiences. On that 40ms-average service the p50 was about 12ms and the p99 was just over two seconds. Two seconds! One request in a hundred was waiting two seconds, and at the volume we were running, "one in a hundred" was thousands of people an hour having a thoroughly bad time, completely invisible behind a reassuring average.

The reason this matters so much in practice is that users do not experience your median, they experience their own requests, and a single page often makes many. Load a screen that fans out to twenty backend calls and the chance that at least one of them hits your p99 is not one percent, it is closer to one in five. Tail latency compounds. The slow path you dismissed as a rounding error becomes the common case for any user doing something non-trivial.

So a few habits I now treat as non-negotiable. Put percentiles on the dashboard, p50, p95, p99, and retire the average to wherever old metrics go to die. Be suspicious of any latency figure quoted as a single number, and ask "which percentile" until someone tells you. And when you do optimise, optimise the tail, because shaving the median makes a nice graph and shaving the p99 makes users stop complaining.

The average is not lying exactly. It is answering a question nobody asked. "What is the mean of these numbers" is arithmetic. "How bad is it for the people having a bad time" is the question you actually care about, and only the percentiles will tell you.