ebpf, and finally seeing what the kernel sees

A graph on a screen in front of a server rack

I had a box that was slow in a way nothing would explain. CPU was fine. Memory was fine. The application logs were clean. The latency graphs showed a fat tail that appeared and vanished with no correlation to anything I was measuring, which is the worst kind of problem, because it means I was measuring the wrong things.

The old reflexes get you only so far here. strace will happily tell you which syscalls a process is making, but it does so by stopping the process at every one, which on a busy server changes the answer so badly that you might as well be measuring a different program. perf is better and I reach for it often, but for the specific question I had, which was roughly "what is actually causing these stalls and how often", I wanted to ask the kernel directly without paying a fortune for the privilege.

This is where eBPF has quietly changed what's possible.

what it actually is

The short version: eBPF lets you load small, verified programs into the running kernel that fire on events you care about, syscalls, function entry and exit, tracepoints, and aggregate the results in kernel space. You get the data out as histograms or counts or stacks without copying every event to userspace and without stopping the process you're watching. The verifier refuses to load anything that could loop forever or read memory it shouldn't, so you're not one typo away from a panic on a production box. That last property is the reason I'm willing to run this on machines that matter.

In practice, on this kernel, I drove it through the bcc toolkit rather than writing raw bytecode, because I value my afternoons.

Tracing output scrolling past, the closest thing to seeing the kernel think

the tools that actually found it

The first thing I ran was biolatency. It prints a histogram of block I/O latency, and within a few seconds the shape of the problem was obvious:

$ sudo biolatency -m 5 1
Tracing block device I/O... Hit Ctrl-C to end.

     msecs               : count     distribution
         0 -> 1          : 4821     |****************************************|
         2 -> 3          : 612      |*****                                   |
         4 -> 7          : 88       |                                        |
       128 -> 255        : 17       |                                        |
       256 -> 511        : 9        |                                        |

Most I/O was sub-millisecond, exactly as you'd hope. But there was a second little island out at 128 to 511 milliseconds. Not many requests, but enough, and a quarter-second stall at the wrong moment is precisely the sort of thing that produces a fat latency tail upstream while every average you look at stays reassuringly green. Averages had been lying to me, as they always do.

From there bcc gave me the rest of the story. ext4slower showed which files were behind the slow operations, which pointed at a log directory I'd long forgotten was on spinning rust rather than the SSD. cachestat confirmed the page cache hit rate was dropping right when the stalls appeared, because something was periodically reading a large file and evicting everything useful. The culprit, eventually, was a backup job I had configured and then completely forgotten about, scanning a tree on the slow disk and trashing the cache for everyone else.

A bcc script open in an editor, where the real work happens

why this matters beyond this one box

I could have found this another way, eventually. With enough patience and enough iostat and a lot of guessing, the answer was reachable. But the thing that struck me was how the questions changed. Instead of forming a hypothesis and then hunting for evidence to confirm or deny it, I could just ask the kernel what was happening and let the histogram tell me where to look next. The tooling stopped being a way to confirm my theory and became a way to not need one yet.

A few honest caveats. You want a reasonably modern kernel for the good stuff, and the available toolkit varies with kernel version, so some scripts that work on one box won't on another. The learning curve for writing your own programs rather than running the prepackaged ones is real. And it is entirely possible to load a tracing program that, whilst safe, is heavy enough to add measurable overhead, so measure the observer too.

None of that changes the conclusion. For years, "what is the kernel doing right now" was a question you answered with sampling, inference, and a degree of faith. Now you can largely just ask. After three days of staring at graphs that told me nothing, watching that histogram resolve the problem in seconds was the most fun I'd had at a terminal in a while.