The box was meant to be quiet. It serves a handful of static files and runs a couple of small daemons, the sort of machine you forget exists until a graph nudges you. The graph that nudged me showed one core sitting at a steady 100% while the other seven dozed. Load average just above 1.0, which on an eight-core machine is almost a rounding error, easy to ignore. But a pegged core on an idle host means something is in a hot loop, and a hot loop is rarely doing anything you asked for.
top told me what I half expected: nothing useful. CPU usage spread thinly across a dozen processes, none of them obviously the culprit, because the offender was spending its time in the kernel on this host's behalf and top's per-process view doesn't show you that clearly. So I reached for perf.
sudo perf top -g
perf top is top for functions. Instead of processes it shows you the symbols where the CPU is actually spending cycles, sampled live, kernel and userspace together. Within a second or two the top line was unambiguous: the machine was spending the bulk of its time in __softirqd and the networking receive path. Not application code at all. The CPU was busy handling packets.
That reframed the whole thing. A core burning in softirq on the network receive path means a flood of packets arriving, all landing on the same queue and therefore the same CPU. So I stopped looking at processes and started looking at the wire:
sudo tcpdump -ni eth0 -c 2000 | \
awk '{print $3}' | cut -d. -f1-4 | sort | uniq -c | sort -rn | head
One source address accounted for nearly all of it. A misconfigured monitoring agent elsewhere on the network had got itself into a retry storm and was hammering a port that this box wasn't even listening on. Every packet still cost us an interrupt, a trip up the receive path, and a quiet RST on the way out. Thousands per second, all pinned to one core because the NIC's receive steering put that flow on a single queue.
The fix was on the other machine, not this one. I dropped the traffic at the firewall as a stopgap, the core went cold within seconds, and then went and fixed the agent that was actually misbehaving.
The lesson I keep relearning: top tells you which process, perf top tells you which function, and sometimes the function is the kernel telling you the problem isn't on this box at all. If a core is hot and your own code isn't on the perf list, stop reading your application and start reading the network.