Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

perf top on a box that should have been idle

Tracking down a steady 30% CPU on an idle homelab VM using perf top, and the unglamorous culprit behind it.

A performance graph on a server monitor

I have a small VM that does almost nothing. It runs a couple of cron jobs, holds a syncthing instance, and is otherwise asleep. So I was surprised when Grafana showed it sitting at a steady 30% CPU at three in the morning, with nothing scheduled, no users, and no obvious reason.

top told me the time was going to kworker and a process called, helpfully, nothing recognisable. Load average around 1.2 on a two-core box. Not on fire, but not idle either, and "not idle" on a box that should be idle is exactly the sort of thing that nags at me until I find it.

reach for perf

When top won't tell you why a kernel thread is busy, perf top will. It samples the CPU and shows you which functions are actually burning cycles, kernel and userspace together.

sudo perf top -g

Straight away the top of the list was full of softirq and timer functions. __do_softirq, rcu_core, and a long tail of network receive paths. That pointed at interrupts rather than any user process, which is why top looked so innocent: the work wasn't attributed to anything I'd think to look at.

A close-up of profiler output on screen

following the interrupts

If it's softirq and network receive, the next stop is the interrupt counts.

watch -n1 'grep -E "eth|virtio" /proc/interrupts'

The virtio network interrupt was ticking over far faster than the near-zero traffic justified. Something was hammering the interface with tiny packets. iftop showed barely any throughput in bytes, but the packet-per-second count was high. Lots of packets, almost no data: classic chatter.

A quick tcpdump and there it was. An old monitoring agent I'd forgotten about was polling a dead endpoint, retrying instantly on connection refused, hundreds of times a second. No backoff. It had been doing this for weeks. The connection failed in microseconds, so it just tried again, and again, and the cost was all in the kernel handling the connection churn rather than in the agent's own CPU column.

systemctl stop ancient-metrics-agent
systemctl disable ancient-metrics-agent

CPU dropped to the floor immediately. The graph went flat and stayed flat.

the takeaway

Two things stuck with me. First, perf top is the tool I reach for far too late: when top shows the kernel busy and won't say why, it answers the question in about ten seconds. Second, a tight retry loop with no backoff is one of the quietest ways to waste a machine. It doesn't show up where you'd look, it doesn't throw errors anyone reads, and it'll happily sit there at 30% for a month until someone notices the graph. Always add a backoff. Even a crude one. Your future self, squinting at an idle box that isn't idle, will thank you.