the average is lying to you, look at p99

A latency graph on a monitoring dashboard

The dashboard said the service was healthy. Average response time 42ms, flat green line, nothing on fire. The support queue said otherwise: a steady trickle of people complaining that the app "sometimes just hangs". Both were true at once, and reconciling them is the whole reason percentiles exist.

The mean is the most comforting and least useful number you can put on a latency graph. It tells you the centre of mass of your distribution, which would be fine if latency were symmetric. It isn't. Latency is a long-tailed, right-skewed thing: a wall of fast requests on the left and a thin, nasty tail stretching off to the right. The tail is where the pain lives, and the average smears it into nothing.

what the numbers actually mean

A percentile is a promise about a fraction of your requests. p99 latency of 900ms means 99% of requests came back faster than 900ms and 1% were slower. That 1% sounds negligible until you do the arithmetic. At a thousand requests per second, p99 of 900ms is ten requests every single second taking nearly a second or worse. Over a day that is the better part of a million slow experiences, and a fair number of those land on the same handful of heavy users who happen to make a lot of requests.

There is a famous observation from Google's Jeff Dean about this: if a user request fans out to a hundred backends, and each backend has a p99 of 10ms, then the slowest of those hundred calls dominates, and roughly two-thirds of your user requests will hit at least one of those p99 events. Tail latency doesn't stay in the tail when you compose services. It propagates upward and gets worse with every hop.

Source code on a screen

why you can't average percentiles

Here is the mistake I see most often, and made myself for years. You have p99 latency reported per minute, and you want the p99 for the hour, so you average the sixty per-minute p99 values. This is wrong, and not slightly wrong. Percentiles are not additive and they are not averageable. The p99 of a combined set is not the mean of the component p99s, because you have thrown away the distribution and kept only a summary of it.

The only correct way to roll up percentiles over time is to keep the underlying distribution, which in practice means histograms. Prometheus does this well: a histogram metric buckets your observations, and histogram_quantile() computes the percentile across whatever time range you ask for, from the raw bucket counts. You aggregate the buckets, then compute the quantile, never the other way around.

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

The accuracy of that number is only as good as your bucket boundaries. If your buckets jump straight from 100ms to 1s, your p99 estimate anywhere in that gap is a coarse interpolation and you should not quote it to three decimal places. Choose bucket edges that bracket the latencies you actually care about, then live with the resolution you picked.

the coordinated omission trap

There is a subtler way your latency numbers lie, and it's the one that catches load testing tools red-handed. It's called coordinated omission, and Gil Tene has spent years trying to get people to take it seriously. The short version: most measurement systems only record the latency of requests they actually sent, and they stop sending requests while the system is stalled.

Picture a load generator aiming for one request every 10ms. The server freezes for a second. A naive tool sends one request, waits a full second for it to come back, records that single 1000ms sample, and then carries on. But in that frozen second it should have sent a hundred requests, and every one of them would have experienced somewhere between 1000ms and 10ms of latency depending on when it arrived. Those ninety-nine missing samples were the worst experiences your users would have had, and your tool quietly omitted them because it was politely waiting too.

The effect is that your p99 looks fantastic during exactly the moments your system is behaving worst. The stall gets represented by a handful of samples instead of the flood it deserves, and the percentile maths, fed a censored dataset, reports a tail that simply isn't there. I have watched a benchmark proudly claim a p99.9 of 8ms for a service I could see freezing for whole seconds with my own eyes. Both the tool and the service were doing exactly what they were built to do. The measurement was just structurally dishonest.

The fix is to record latency against when a request should have been sent, not when it actually was, and to back-fill the samples the stall swallowed. Good tooling does this for you now, hdrhistogram and the wrk2 fork of wrk being the obvious examples, but you have to know the trap exists before you'll think to reach for them. Once you've seen it, you stop trusting any latency number that doesn't tell you how it was gathered.

look at the shape, not the scalar

Any single number is a lossy compression of the distribution, so the most honest thing you can put on a wall is a heatmap: time on the x-axis, latency buckets on the y-axis, request density as colour. A bimodal distribution, fast cache hits and slow cache misses, shows up instantly as two bands. An average would have parked a meaningless line in the empty space between them, describing a latency that almost no request ever actually had.

When I finally looked at the heatmap for the "healthy" service, the story was obvious. The fast band sat around 30ms. A second, faint band hovered near two seconds and thickened every few minutes. That was a connection pool exhausting and requests queueing for a free connection. The 42ms average had averaged the 30ms common case with the rare 2s stall and produced a number that described neither, and reassured everyone while a slice of users sat watching a spinner.

The fix was boring, more connections and a sane timeout, the diagnosis was the hard part. So pick p99 and p99.9 as your headline latency SLOs, not the mean. Keep histograms so you can roll those percentiles up correctly. And put a heatmap somewhere you'll see it, because the shape of the distribution will tell you what's wrong long before any single scalar admits there's a problem at all.