ebpf, or how i stopped guessing and watched the kernel

A latency graph climbing across a server dashboard

For years my performance toolkit had a hole in the middle of it. I had metrics from above, the request latency, the error rate, the CPU graphs, and I had strace and perf from below, which are powerful but blunt and expensive. What I didn't have was a way to ask the kernel a precise question about a running production process and get an answer without bringing the process to its knees. eBPF closed that hole, and once it did I stopped guessing.

The story that converted me was a service that was, by every dashboard, healthy. p50 latency fine, CPU fine, memory fine. But p99 would occasionally spike to several hundred milliseconds for no reason anyone could see. The metrics told me that it happened. They were useless at telling me why, because the why was happening inside a syscall, in a place my application-level tracing couldn't reach.

what eBPF actually is, briefly

eBPF lets you load a small, verified program into the running kernel and attach it to a hook: a kprobe on a kernel function, a tracepoint, a syscall entry or exit. The program runs in kernel context when that hook fires, can read arguments and timings, and writes results back to userspace through maps. The verifier guarantees your program can't loop forever or read arbitrary memory, which is why the kernel is willing to run your code at all. The upshot: you get the observability of patching the kernel, without patching the kernel or rebooting.

You rarely write the bytecode by hand. I use two front-ends. bpftrace for quick one-liners, awk-like and disposable. The bcc tools for the heavier, pre-built scripts that ship as a package and just work.

the boring tools first

Before writing anything custom I reached for the bcc tools that already exist, because nine times out of ten somebody has already written the thing you need. biolatency for block I/O latency as a histogram. runqlat for how long threads sit on the run queue before getting CPU. execsnoop for things being exec'd that you didn't expect.

runqlat was the first hint. Run it during a spike:

$ sudo runqlat 5 1
     usecs        : count    distribution
         0 -> 1   : 2104     |****************************************|
         2 -> 3   : 410      |*******                                 |
       ...
    16384 -> 32767: 38       |                                        |

Most of the time threads were scheduled in microseconds, as you'd hope. But there was a tail out at tens of milliseconds. Something was making my threads wait for CPU at exactly the moments p99 went bad. That ruled out a whole class of theories. It wasn't the database, it wasn't the network. The process was ready to run and the scheduler wasn't running it promptly.

A bpftrace one-liner on a terminal

the one-liner that found it

Run-queue latency that bad usually means contention: something else is hammering the same CPUs. So I wanted to know what was running on those cores during a spike. A short bpftrace script attached to the scheduler tracepoint, counting on-CPU time per process name, did it:

sudo bpftrace -e '
tracepoint:sched:sched_switch {
  @ns[args->prev_comm] = sum(nsecs - @start[args->prev_pid]);
  @start[args->next_pid] = nsecs;
}
interval:s:10 { print(@ns); clear(@ns); }'

The culprit was a log-shipping agent. A sidecar nobody thought about, configured to compress and forward logs, and every so often it would wake up, grab a couple of cores flat-out for a compression burst, and starve the actual service of scheduler time for a few tens of milliseconds. On a quiet box you'd never notice. Under load, those tens of milliseconds landed on whichever unlucky request was mid-flight, and that request became the p99.

why this beat the alternatives

I could, in principle, have found this with perf and enough patience. But perf record over a long enough window to catch a rare spike produces an enormous trace, and the overhead itself can distort the very timing you're chasing. eBPF let me aggregate in the kernel and only ship summaries to userspace. The histogram was built in-kernel by runqlat; I never moved a million individual events across the boundary. That is the whole trick: do the reduction where the data is, before it costs you anything.

The other alternative was inference. Stare at graphs, form a theory, change something, see if the graph moves. I'd spent two days doing exactly that before reaching for eBPF, and I'd been confidently wrong about the cause the entire time. The kernel knew. I just hadn't asked it.

the fix and the lesson

The fix was unglamorous: pin the log agent to a different CPU set with taskset, cap its niceness, and stop it sharing cores with the request path. p99 flattened that afternoon.

The lesson I keep is that the most expensive part of a performance problem is usually the time spent guessing. eBPF doesn't make you cleverer. It makes guessing unnecessary, because you can attach a probe to the exact function you're suspicious of and watch it answer. Once you've felt that, going back to inferring kernel behaviour from userspace metrics feels like reading shadows on a wall.

If you've not tried it, install the bcc tools on a test box and run execsnoop for a minute. Watch every process the machine spawns scroll past in real time, things you had no idea were running. That moment, seeing what the kernel sees, is the one that sticks.