Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

the idle server that was burning a whole core

A supposedly idle box was sitting at fifteen percent CPU doing nothing, and perf top walked me straight to the culprit.

A flame-graph-style CPU profile on a monitoring dashboard

The box had no business being busy. It served a handful of low-traffic internal endpoints, a metrics scrape every fifteen seconds, and otherwise sat there. Yet top showed it idling at fifteen percent CPU, steady, all day, like a tap left half on. Fifteen percent of one machine is not a crisis. It's the kind of thing you note and move past. But "I don't know why" is a different feeling to "it's fine", and I couldn't shake it.

So I did what I should always do first and stopped guessing.

top tells you who, not why

The first instinct is top, and top is genuinely useful here: it told me the process. A single Go service, our internal admin API, sitting at a steady ~12% on its own. That narrows the field, but it answers the wrong question. I knew which process. I had no idea what it was doing to earn that. A process name is not a stack trace.

This is the gap perf fills. top samples processes; perf top samples the CPU's instruction pointer thousands of times a second and tells you which functions are actually executing. It's the difference between "the noise is coming from the kitchen" and "the tap is dripping into the sink".

perf top, and the function that shouldn't be there

perf top -p $(pgrep -f admin-api)

You want it scoped to the process, otherwise you're profiling the whole machine and the signal drowns. With the binary built with symbols, the output was immediate and unambiguous. Right at the top, eating the lion's share of samples, was a call into the time package and the runtime's notification machinery. Not business logic. Not request handling. Plumbing.

  18.42%  admin-api  [.] time.Now
  11.07%  admin-api  [.] runtime.notetsleep
   9.83%  admin-api  [.] runtime.findrunnable
   6.51%  admin-api  [.] time.(*Timer).Reset

When time.Now and timer machinery dominate a profile on a service that's doing nothing, you don't have a slow algorithm. You have something waking up far too often to do nothing. The CPU wasn't working hard. It was busy getting out of bed.

A code editor showing a tight polling loop

the busy-wait that called itself a poll

The shape of the profile pointed straight at a hot timer, so I went looking for the tightest loop in the codebase. It didn't take long. Somebody, and "somebody" here is a person I have met and possibly was, had written a background goroutine to watch for config changes. The intent was reasonable: notice when a file on disk changed and reload it. The implementation was a for loop with a sleep, and the sleep had been written as:

for {
    if changed := checkConfig(); changed {
        reload()
    }
    time.Sleep(time.Millisecond)
}

One millisecond. A thousand wakeups a second, each one stat-ing a file that changes maybe once a month, each one churning through the scheduler to do absolutely nothing useful. On its own a stat is cheap. A thousand a second, forever, on a machine that's otherwise idle, is precisely the fifteen percent that had been nagging at me. The runtime.findrunnable in the profile was the scheduler frantically looking for work to do every time this goroutine handed control back, finding none, and going round again.

The really annoying part is that this would never show up in any sensible test. It produces correct results. It reloads config when config changes. It passes review if the reviewer isn't squinting at the sleep duration. It just costs a steady tax forever, on every instance, and nobody connects the dots because fifteen percent idle is the sort of thing you learn to ignore.

There's a reason the profile pointed so cleanly at the cause. A busy-wait doesn't look like a hot algorithm in a flame graph, where you'd expect one fat function doing real arithmetic. It looks like the runtime itself doing housekeeping: timers, scheduler bookkeeping, the clock being read over and over. Once you've seen that signature, time.Now and findrunnable near the top of an idle service, you recognise it instantly the next time. The skill isn't reading the profiler, it's knowing what a pathological pattern looks like when it shows up in one.

the fix is mostly "stop doing that"

The cheap fix is a bigger sleep. Polling once a second instead of a thousand times a second is a factor of a thousand, and config that's stale by up to a second is fine for our purposes. That alone would have killed the problem dead.

for {
    if changed := checkConfig(); changed {
        reload()
    }
    time.Sleep(time.Second)
}

The proper fix is to not poll at all. fsnotify will let the kernel tell you when the file changes via inotify, and your goroutine sleeps genuinely idle until something actually happens. That's the version that shipped, because once you've been bitten by a polling loop you stop trusting polling loops.

After the change, CPU on the idle box dropped to roughly nothing, the steady fifteen percent gone. Across the fleet that's a meaningful chunk of capacity reclaimed for the cost of an afternoon, but honestly the capacity wasn't the point. The point was that I no longer had a number I couldn't explain.

what I actually took away

A few things stuck, and they're not new, which is rather the point.

  • top finds the process. perf top finds the function. When you're asking "why is this busy", you almost always want the second one, and reaching for it first saves you a round of bad guesses.
  • Build your production binaries with symbols. A profile full of hex addresses is a profile you can't read. The few megabytes are worth it the first time something goes wrong at 2am.
  • "It's only fifteen percent" is how this stuff hides. Small, steady, unexplained waste is more dangerous than a dramatic spike, because the spike gets investigated and the steady tax gets normalised.
  • Anything with time.Sleep in a tight loop deserves a hard look. If you're polling, ask whether the kernel could just tell you instead.

I keep relearning that profiling is faster than thinking. I could have spent the afternoon reading code and theorising about what an idle admin service might be doing. Instead I spent ninety seconds with perf top and it pointed at the answer before I'd finished forming a hypothesis. The tool was right and I was about to be wrong, which is the usual arrangement, and I've made my peace with it.