who is eating all the cpu on an idle server

A server load graph on a monitoring screen

A box that does almost nothing was sitting at 30% CPU and I wanted to know why. Nothing in top looked obviously guilty: a long tail of processes each using a percent or two, nothing pegging a core, no single villain. That spread-out kind of usage is the worst sort, because the thing burning cycles is hiding inside everything at once.

top tells you which process. It doesn't tell you what that process is actually doing with the CPU, and when the cost is smeared across the whole system, the process column stops being useful. For that you want perf top, which samples the running kernel and userspace and shows you the hottest functions across the entire machine, regardless of which process they belong to.

perf top -g

Run it and you get a live list of the functions consuming the most CPU, sorted, updating in real time. Within seconds the answer was sitting at the top: a pile of time in the network stack and in TLS handshake routines. Symbols like tls_process_client_hello and a great deal of time spent in __libc_connect and socket setup. Not data transfer. Connection setup, over and over.

A terminal running perf top

That pointed straight at something opening connections far too often. A bit of ss -tnp and a tcpdump later, the culprit was a health check. Some monitoring config had been deployed with a one-second interval instead of the thirty I'd intended, and it was opening a fresh HTTPS connection every single second, doing a full TLS handshake, asking /healthz if it was alive, and tearing the connection down. Multiply that by a handful of monitored endpoints and the "idle" box was spending a third of its life negotiating TLS sessions it immediately threw away.

The fix was two characters in a config file: 1s became 30s. CPU dropped to where it should have been. But the lesson is in the diagnosis, not the fix. Process-level tools like top are the wrong altitude when the cost is spread thin and the work is something cheap done absurdly often. A TLS handshake is not expensive. A TLS handshake a thousand times a minute is.

perf top is the tool I reach for whenever the usage doesn't match the workload and no single process owns the blame. It looks past the process boundary and tells you what the silicon is actually doing, which on that day was a great deal of cryptography in service of repeatedly asking a server whether it was still alive. It was. It was just busy answering.