Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

watching the kernel work, finally, with ebpf

Using bpftrace and the bcc tools to chase a latency problem that strace and perf could not explain, and what it feels like to instrument a running kernel without restarting anything.

A latency graph on a monitor next to a rack, the line spiking in a way that ruins your afternoon

For most of my career, "seeing what the kernel is doing" meant one of two unsatisfying things. Either you bolted strace onto a process and slowed it down so much that the problem you were chasing changed shape, or you ran perf and stared at a flame graph that told you where the time went but not why. Both are useful. Neither lets you ask the kernel a specific question about a live, production-load process and get a specific answer back without restarting anything. eBPF does, and the first time it actually answered a question I had been failing to answer for two days, I sat back and laughed.

The problem was latency on a service that was, by every metric we had, doing nothing wrong. CPU was fine. Memory was fine. The application traces showed the request spending almost all its time in a single file read that should have been instant, because the file was small and surely cached. Should have been. The whole investigation lived in that word.

asking the kernel directly

strace confirmed the read was slow but distorted the timing so badly I could not trust the numbers. So I reached for bpftrace, which lets you attach a tiny program to a kernel function and aggregate what you see, with overhead low enough to run against real traffic. The first question was simple: how long are reads actually taking, as a distribution, not an average.

# histogram of read() latency in microseconds, system-wide
bpftrace -e '
  tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
  tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }'

An average would have lied to me. The histogram did not. Most reads were sub-microsecond, exactly as you would expect from page cache. But there was a second, small bump out in the millisecond range. A bimodal distribution: nearly everything instant, a thin tail of reads that were a thousand times slower. Averages smear a bimodal distribution into a single misleading middle. The histogram showed the two populations plainly, and the slow tail was where my latency was hiding.

A close-up of bpftrace output, a text histogram with two distinct humps

from "reads are slow sometimes" to "this read, this block"

Knowing some reads were slow was not enough. I needed to know which reads, and why those. This is where the bcc tools earn their keep. biolatency showed me block-device latency directly, confirming that the slow reads were going all the way to disk rather than being served from cache. cachestat confirmed the cache hit rate was not the lovely 99-point-something I had assumed.

$ /usr/share/bcc/tools/cachestat 1
    HITS   MISSES  DIRTIES  HITRATIO
  148210      372       11    99.75%
  149003     1820        9    98.79%
   91204    11038        6    89.20%   <-- here

There it was, in real time, on real load. Periodically the hit ratio sagged, and those were exactly the moments the latency tail appeared. The file I had assumed was permanently cached was being evicted under memory pressure from a completely unrelated batch job that ran on the same host and churned through a large dataset, flooding the page cache and pushing my small, hot, supposedly-always-resident file out. When that happened, the next read went to disk, took a millisecond instead of a microsecond, and showed up as a latency spike in a service that had no idea it was sharing a cache with a noisy neighbour.

To make the link airtight rather than circumstantial, I left both probes running side by side for a few minutes and watched them move together. Every time cachestat showed the hit ratio dip, the read-latency histogram grew its slow tail, and the batch job's working-set churn was visible in vmstat at the same moments. Three independent views of the same machine, all pointing at the same five-second windows, all caused by a job that the application I was debugging had never heard of. That is the kind of correlation you can defend in a postmortem, because it is not a hunch, it is three instruments agreeing.

None of my application instrumentation could have found this, because from the application's point of view it issued a read and the read was sometimes slow. The cause was entirely in the kernel's memory management, in the interaction between two processes that did not know about each other, and the only way to see it was to instrument the kernel while both were running under real load.

why this changes how I debug

The thing eBPF gives you that the old tools do not is the ability to ask a new question of a running system without redeploying it. With strace you pay a tax that changes the experiment. With logging you have to have predicted the question before the incident and added the log line in advance. With eBPF you write the probe in the middle of the incident, attach it to live traffic, get an answer in seconds, and detach it leaving nothing behind. The kernel keeps running the whole time, none the wiser.

The fix here was mundane once I understood it: the batch job got cgroup limits so it could not monopolise the page cache, and the latency tail went away. But I would never have got to that fix from the application side. I would have spent another two days adding timers to code that was not the problem.

There is a cultural shift in here as well as a technical one. For a long time the kernel was a black box you were expected to leave alone, and the working assumption was that if your problem was down there, you were probably wrong and it was probably your code. eBPF quietly removes that excuse. The kernel is now a thing you can interrogate as readily as your own application, with safety guarantees that mean a bad probe gets rejected by the verifier rather than panicking the box. That changes who gets to debug what. The boundary between "application problem" and "kernel problem" used to be a wall you handed the ticket over. Now it is a line you can simply step across with a one-liner.

If you have not used bpftrace or the bcc tools yet, the on-ramp is gentler than the reputation suggests. Start with the one-liners, point a histogram at a syscall you care about, and look at the distribution rather than the average. The first time the kernel answers a question you could not answer any other way, you will understand why people get evangelical about it. I am trying very hard not to be that person, and failing.