Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the idle box that wasn't

A supposedly idle server burning a steady chunk of CPU, and how perf top found the kernel function doing it in under a minute.

A performance graph on a server monitor

A box that does almost nothing was sitting at a steady eight percent CPU, all day, every day. Eight percent is not a fire. It is worse: it is just enough to be annoying and not enough to make anyone investigate. So nobody had, for months.

top told me the usual lie, which is that nothing in particular was responsible. The CPU was spread thinly across kernel time with no obvious userspace culprit. When user time is low and system time is high and no single process owns it, you stop looking at processes and start looking at the kernel.

perf top -g

Inside ten seconds the top frame was sat in __softirqd territory with a fat slice in the network receive path. That pointed at interrupts, so I checked /proc/interrupts and found one NIC queue taking the lot, pinned to a single core, hammering it with a small but relentless stream of packets.

The packets turned out to be a monitoring agent on another host polling a port every second that no longer had a listener, so the box was spending its day politely sending RST replies. Eight percent of a core, to say "no" over and over to something that wasn't listening to the answer.

I killed the orphaned poll, the graph dropped to a flat line, and the box went back to doing nothing properly. The lesson, again, is that perf top answers "what is the CPU actually doing right now" faster than any amount of staring at process lists, especially when the time is hiding in the kernel where top won't show you the shape of it.