Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the average response time that hid a fire

Why mean latency lulled us into thinking a service was healthy while one in a hundred requests was timing out.

A latency graph on a monitoring dashboard

Our dashboard said the average response time was 80ms. Customers said the site was broken. Both were true, and the gap between them is the whole reason percentiles exist.

The average is a liar in a very specific way. It tells you about the middle of the distribution, and latency distributions are not symmetric. They have a long right tail: most requests are quick, and a small number are catastrophically slow. Average those together and the slow ones vanish into a comfortable-looking number. Meanwhile the people experiencing the tail are having a genuinely bad time, and they're often your most active users, because the more requests you make the more likely one of them lands in the slow tail.

So we switched what we looked at. Not the mean, but p50, p95, and p99: the values below which 50, 95, and 99 per cent of requests fall.

A monitoring dashboard with percentile lines

The numbers told a completely different story. p50 was 40ms, faster than the average had suggested. p95 was 600ms. p99 was 4.2 seconds. That p99 was the fire. One request in a hundred was taking over four seconds, which for a page that makes a dozen backend calls means most page loads contained at least one of them. The "average" had drowned that signal in a sea of fast requests.

A few things are worth getting right once you start caring about tails.

You can't average percentiles. If service A has a p99 of 100ms and service B has a p99 of 100ms, the combined p99 is not 100ms, and it's not 200ms either. Percentiles don't add. To get a real number you need the underlying histograms, which is why proper monitoring stores latency as buckets, not as a precomputed average per minute. We were storing averages per minute. That was the original sin: by the time the data hit the dashboard, the tail had already been thrown away and no amount of clever querying could get it back.

The fix for the actual slowness, once we could see it, was almost dull. A missing index on one query path that only triggered for accounts above a certain size. Those accounts were rare, hence one in a hundred, and they were our biggest customers, hence the loud complaints. We'd been optimising the median for months and the median was already fine.

The lesson I keep relearning: measure the experience of your unluckiest users, not your typical one. The average customer is having a lovely time. It's the p99 customer who files the ticket, and they're usually the one paying you the most. Watch the tail, store the histogram, and treat any single number that claims to summarise latency with the suspicion it deserves.