perf top on a box that shouldn't be busy

A monitoring graph on a server screen

A machine that should be idle, sitting at twenty percent CPU, is one of the more quietly maddening problems in ops. Nothing is broken. Nothing is alarming. Everything works. But that twenty percent is heat, and electricity, and headroom you've lost, and above all it's a question with no obvious answer: what on earth is it doing? This is the story of one such box and the afternoon it cost me, because the path from "something's odd" to "oh, of course" is the genuinely useful part.

the symptom

The box runs a handful of small services and should spend nearly all its life asleep. htop showed load hovering around 0.8 on a four-core machine with no user load to speak of. No single process stood out: a smear of small percentages across a dozen processes, none of them obviously the culprit. That smeared-out look is itself a clue. When one runaway process is eating a core you spot it instantly. When the cost is spread thin across many processes, you're usually looking at something systemic: lots of small repeated work rather than one big loop.

first look: where is the time going

The first question with any "busy but I don't know why" box is whether the time is going on user code or in the kernel. vmstat 1 answered it:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 612344  18220 901120    0    0     0     2 4100 9800  4 15 81  0

The interesting number isn't the CPU columns, it's cs: nearly ten thousand context switches a second on a box doing nothing. And in, the interrupt count, sat unusually high too. That's not a CPU-bound workload chewing through computation. That's a machine being woken up constantly to do tiny scraps of work and then going back to sleep, over and over. The system time being well above the user time pointed the same way: the cost was in the kernel, in scheduling and syscalls, not in anyone's application logic.

Source code on a terminal

perf top, finally

So I reached for perf top, which samples the CPU and shows you, live, which functions are actually burning cycles across the whole system, kernel included.

perf top -g

The top of the list was a parade of kernel scheduler and syscall-entry symbols: __schedule, finish_task_switch, do_epoll_wait and friends. None of that is a smoking gun on its own, every running Linux box spends time there, but the proportions were wrong for a machine that should be idle. The scheduler shouldn't be in your top five when nothing's meant to be happening. This confirmed the vmstat reading: the work was the act of constantly waking up and switching, not anything being computed.

following the wakeups

perf top told me what the CPU was doing but not who was asking it to. For that I wanted to know which processes were generating the syscalls. A quick strace -f -c -p on the busiest few PIDs from htop was revealing. One of them, a little internal health-check daemon I'd half forgotten writing, was making an enormous number of epoll_wait and clock_gettime calls, returning almost instantly each time, then going straight round again.

There it was. The daemon had a poll loop, and the timeout on that loop was meant to be a sensible number of seconds. Except the value was being read from a config field that, after a refactor some months earlier, no longer existed, so it was defaulting to zero. A zero timeout on a poll loop isn't a pause, it's a busy-wait. The thing was spinning as fast as the scheduler would let it, waking up, finding nothing to do, and immediately asking again, thousands of times a second, across every core it could touch.

the fix and the lesson

The fix was a one-line change to read the timeout from the field that actually exists, with a non-zero default so a future typo can't reintroduce a spin loop. The box dropped to where it should have been all along: load near zero, context switches down to the low hundreds, perf top showing the idle task at the top as nature intended.

The general shape of this is worth keeping. A busy-but-idle machine is almost always one of three things: a busy-wait or zero-timeout loop, something retrying a failing operation as fast as it can, or a misconfigured poll interval. All three look identical from the outside, a smear of system time and a flood of context switches, and all three are invisible to "which process is using the most CPU" because the cost is spread thin and lives in the kernel.

There's a second perf view worth knowing about for this class of problem, which is perf sched. Where perf top shows you which functions burn cycles, perf sched record followed by perf sched latency shows you the scheduling behaviour itself: how often each task is being woken, how long it runs before it yields, how much time it spends waiting. For a busy-wait you'll see a task being scheduled in and out at an absurd rate, running for microseconds at a time, which is the signature of a loop that yields and immediately wants the CPU back. I didn't need it this time because strace -c got me there first, but on a box where the spinning process isn't one of your own and the syscall counts are less obvious, it's the tool that points the finger.

The tools that find them are the boring ones. vmstat to see that the cost is context switches rather than computation. perf top to confirm the time is going on scheduling and syscall machinery rather than anyone's actual code. strace -c to find which process is generating the storm of syscalls, and perf sched when you need to see the scheduling pattern directly. Each tool answers one question and hands you to the next. The clever bit isn't any single command, it's reading the first measurement carefully enough to know which command to reach for second. A high context-switch count on an idle box told me to stop looking for a greedy process and start looking for a frantic one, and that change of question was the whole job.

One last note for anyone chasing this on their own kit. Resist the urge to fix it the moment you find the busy loop, and instead make sure you can explain every measurement first. I very nearly stopped at "the daemon is spinning, raise the timeout" without understanding why the timeout was zero, and had I done that I'd have papered over a config-parsing bug that was sitting there waiting to cause its next surprise somewhere else entirely. The wrong default was the real fault. The spinning was just the symptom that happened to be visible. Twenty percent CPU on an idle box was the machine doing me the favour of making a silent bug loud.