A dashboard told me our average response time was 40ms and everyone was delighted. Users, meanwhile, were complaining the service felt slow. Both were correct, which is the whole problem with averages.
An average flattens the tail. If 99 requests take 20ms and one takes 2 seconds, your mean is a perfectly cheerful 40ms, and one in a hundred of your users just had a genuinely bad time. On a page that fires twenty backend calls, that "one in a hundred" tail event happens on roughly one page load in five. The mean says fine; the users say not fine; the users are right.
So I stopped looking at the average and started looking at percentiles. p50 (the median) tells you the typical experience. p99 tells you what your unluckiest one percent endured, and that's the number that drives the complaints. Ours was over a second while the mean sat smugly at 40ms. The gap between p50 and p99 is the shape of your suffering, and the average hides it on purpose.
Plot the percentiles, not the mean. p50, p95, p99, ideally p99.9 if you have the request volume. The moment I did, the cause was obvious: a tail of slow requests hitting a cold cache path, completely invisible to the average that everyone had been so happy about.