Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

perf top on a box that was meant to be idle

An idle server pinning a core, and the two minutes with perf top that told me where the time was going.

A performance graph on a server console

A box that does almost nothing was sitting at a steady twelve percent on one core. Not enough to alert on, just enough to nag at me, because nothing scheduled on it should cost a permanent chunk of CPU. top showed the load spread thinly across a few processes, which is the least useful answer it can give you.

So I reached for perf top. It samples where the CPU actually is, kernel and userspace, and ranks the hot functions live. Within a few seconds the top line was a softirq path and __netif_receive_skb, which told me this was network, not anything I had deployed.

The cause was a monitoring agent polling far harder than I had configured, hammering a local endpoint a few hundred times a second because of a units mistake: milliseconds where I had meant seconds. The CPU was real, just spent waking up to answer pointless requests.

perf top did not fix anything. It just pointed at the right wall in under two minutes, which on a box where the symptom is "vaguely busy and I have no idea why" is most of the battle. I keep forgetting it exists until something like this, and then I remember why it earns its place.