Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

ebpf, or finally being able to ask the kernel a question

How eBPF and the bcc/bpftrace tools let me watch what the kernel is actually doing on a live host without strace's overhead, and the handful of one-liners I keep reaching for.

A performance graph on a server monitoring screen

For years, when a process on a production host was misbehaving in a way top and the logs couldn't explain, my options were grim. I could strace it and accept that the tracing overhead might be worse than the problem I was chasing, sometimes badly so on a busy process. I could guess. Or I could try to reproduce it somewhere I was allowed to break things, which for the interesting bugs is exactly where they refuse to appear. What I actually wanted was to ask the running kernel "what are you spending your time on, right now, on this box" and get an honest answer without perturbing it. eBPF is that, and the first time it worked I was genuinely delighted.

The pitch, stripped of the hype: you can attach small, verified programs to kernel events (syscalls, function entry and exit, tracepoints, disk and network paths) and aggregate what they see in the kernel, returning only summaries to userspace. Because the counting happens in-kernel and you get histograms back rather than a line per event, the overhead is a fraction of what strace costs. You can leave it running on a busy production host and it barely registers.

the tools I actually use

I don't write raw eBPF. Almost nobody needs to. The bcc toolkit and bpftrace give you a shelf of ready-made tools and a one-liner language on top, and that covers the vast majority of "what is this host doing" questions.

When something's slow and I suspect disk, biolatency gives me a histogram of block I/O latency, in-kernel, live:

# biolatency
     usecs       : count    distribution
       128 -> 255 : 1832    |********************        |
       256 -> 511 : 4410    |****************************|
       512 -> 1023: 947     |******                      |

That distribution is the whole answer to "is the disk slow, or is it the app". A tail out at high latencies tells a very different story from a tight cluster, and top could never have shown me either.

When a host is doing mysterious filesystem work, opensnoop shows me every file being opened and by whom, which has ended more "why is it reading that" arguments than I can count. And the one I reach for most, execsnoop, prints every process the host spawns as it happens, which is how you catch the cron job, the rogue script, the thing forking ten thousand short-lived processes that never shows up in a point-in-time ps.

Code on a monitor showing a bpftrace one-liner

the one-liner that paid for itself

bpftrace is where it stops being a fixed toolkit and starts being a question-answering machine. The syntax is awk-shaped: a probe, a filter, an action. The one I keep in my back pocket counts syscalls by process so I can see, on a live host, exactly which process is hammering the kernel and with what:

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

Ctrl-C it after a few seconds and you get a tally of syscalls per command. The first time I ran that on a host with unexplained CPU in the kernel, it pointed straight at a process making millions of gettimeofday calls in a hot loop, something no amount of staring at application logs would ever have revealed, because from the app's point of view it was just running slowly.

The mental shift eBPF gave me is the thing worth passing on. I used to treat the kernel as an opaque box that I reasoned about from the outside, by symptom, because the tools to look inside were too expensive to run where the problems actually lived. Now it's a system I can interrogate directly, on the live host, in production, while the problem is happening, at a cost low enough that I don't have to ask permission. That changes how you debug. You stop forming elaborate theories to test offline and start just asking the kernel what it's doing, and most of the time it simply tells you.