A mean response time of 40ms looks lovely on a slide. It is also nearly useless. Averages are wonderful at hiding the thing you actually care about, which is the request that took two seconds while a customer stared at a spinner and decided we were rubbish.
The maths is unforgiving. If 99 requests come back in 20ms and one takes 2000ms, your average is about 40ms and everyone nods. But one in a hundred people just had a bad time, and at any real traffic level that is a lot of people per minute. The average smeared that pain across the happy majority and made it disappear.
So I stopped quoting the mean. The p99 is the honest number: the value that 99% of requests come in under. Watch p50, p95 and p99 together and the shape tells you the story. When p50 is flat but p99 is climbing, you do not have a slow service, you have a service with a tail, and the tail is usually a lock, a cold cache, a GC pause, or one downstream dependency having a wobble.
The trap is that you cannot average percentiles. You cannot take the p99 from each of five boxes and average them to get a fleet p99, that number means nothing. You need the underlying histogram, which is why I now feed everything into something that aggregates buckets rather than pre-computed quantiles. Once you can ask "what did the slowest 1% actually experience" and get a true answer, you stop arguing about whether the service is fast and start fixing the part that is slow.