The alert was not loud. A box that handles a trickle of background work, a few requests a minute at most, was sitting at a steady forty percent CPU. Nothing was on fire. Latency was fine. But forty percent of a core, continuously, on a machine that should be asleep most of the time, is a smell. Something was busy, and nothing should have been.
top was useless in the way top usually is for this. It showed the process eating the CPU, which I already knew, and told me nothing about what inside that process was doing it. So I reached for perf, which is the tool I should reach for first and somehow never do.
perf top, the first honest answer
perf top samples what the CPU is actually executing, right now, across the whole system, and gives you a live ranked list of where the cycles are going. No instrumentation, no restart, no code change. You point it at a running process and it tells you the truth.
perf top -p $(pgrep -f myservice)
The very top line was a system call doing clock arithmetic, sitting well above everything else. That is already a strong hint. A genuinely busy service spends its time in your code or in real I/O. A service stuck spinning spends it in something cheap and repetitive, called far too often. Clock-reading functions near the top of an "idle" process usually mean a loop that keeps checking the time.
flame graphs, the whole picture
perf top told me what was hot. To see how it got there, I recorded a proper profile and turned it into a flame graph, which remains the single best way to understand where a program spends its time.
perf record -F 99 -p $(pgrep -f myservice) -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg
The flame graph made it obvious in a way the text never could. There was a wide plateau, one stack, taking nearly all the on-CPU time. It went down through my service, into a client library, into that library's connection logic, into a retry routine, and then straight back up. A tall, narrow tower repeated edge to edge: the signature of a tight loop calling the same path over and over with no pause between attempts.
the actual bug
The cause was dull, which the good ones usually are. A background worker opened a connection to a dependency that was, at that moment, refused. The library's retry logic was meant to back off between attempts. It was not. A configuration default I had not set meant the backoff was effectively zero, so on every failure it immediately tried again, failed again, and tried again, as fast as the CPU would let it. A hot retry loop hammering a closed door, millions of times a minute, doing no useful work whatsoever.
Two things fixed it. First, a real backoff so a refused connection waits before retrying, which is what should have happened from the start.
retry:
initial_interval: 500ms
max_interval: 30s
multiplier: 2.0
Second, a circuit breaker so that after a run of failures the worker stops trying entirely for a while, rather than treating a persistently dead dependency as something worth retrying forty times a second. With both in place the box dropped to near zero, which is where it should have been all along.
why perf and not strace
People reach for strace when a process is misbehaving, and for some problems it is the right tool. This was not one of them. strace shows you system calls, with their arguments and return values, which is wonderful when the problem is "what is it asking the kernel for and what is it getting back". But it intercepts every call, which makes a process spinning thousands of times a second crawl, and it would have buried me in a torrent of identical failed connect() calls without ever telling me where in my own code they came from.
perf works differently. It samples. A few hundred times a second it interrupts the CPU, notes the current stack, and moves on, so the overhead is a percent or two rather than a tax that changes the behaviour you are trying to observe. That sampling is exactly what builds the flame graph. You are not watching individual events, you are building a statistical picture of where time goes, and for "what is hot" that picture is far more useful than a transcript. The rule I have settled on: if the question is "what is it doing", reach for strace. If the question is "where is the time going", reach for perf. This was firmly the second kind.
what I keep relearning
The lesson is not really about retries. It is that top tells you a process is busy and almost never tells you why, and that the why is usually one perf invocation away. A spinning loop produces no errors, no latency, no obvious failure. It just quietly costs you a core, and on a fleet of machines that is real money and real heat for literally nothing.
So when a box is busier than its workload can explain, do not stare at graphs and guess. Sample it. perf top for the quick answer, a flame graph for the shape, and nine times out of ten you will find something doing a perfectly pointless thing extremely efficiently. The fix is small. The finding it is the work, and perf does most of that for you if you let it.