watching the kernel without a debugger

A monitoring dashboard with latency histograms and kernel traces

For years my mental model of the kernel was a black box with a few labelled dials on the front. top for CPU, free for memory, iostat for disk, and when those did not explain the problem I shrugged and blamed the application. The honest truth is that I could see the inputs and the outputs and almost nothing in between. eBPF is the thing that finally opened the box, and the first time I used it in anger it told me, in about four seconds, something I had been guessing at for a week.

The problem was the usual sort. A service had occasional slow requests. Not many, not consistently, but enough that the p99 graph had a permanent low fever. The application traces said the time was being spent "in a syscall" and then went quiet, which is the tracing equivalent of a shrug. The slow part was below my code, in the kernel, and I had no way to see it.

What eBPF actually is

The short version: eBPF lets you load small, verified programs into the running kernel that fire on events, a syscall entry, a function being called, a packet arriving, and collect data without crashing the machine. The verifier is the important word. It checks your program cannot loop forever or read memory it should not, which is why this is safe to run on a production box in a way that, say, a hand-written kernel module is not. You are not patching the kernel. You are attaching little observers to it.

You almost never write the bytecode yourself. The tools that made it usable for me are the bcc collection, a set of ready-made scripts that wrap all of this up. On the boxes I care about it is one package away, and most of the useful ones are a single command with no arguments.

A close-up of a bcc script's output, histograms drawn in ASCII

The tools that earned their keep

Here is the one that cracked my latency problem. biolatency records block-device I/O latency as a histogram, in the kernel, with effectively no overhead:

# biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
       128 -> 255        : 312      |****                  |
       256 -> 511        : 1430     |********************  |
       512 -> 1023       : 588      |********              |
      1024 -> 2047       : 41       |                      |
      2048 -> 4095       : 9        |                      |
      8192 -> 16383      : 6        |                      |
     16384 -> 32767      : 4        |                      |

There it was. The bulk of I/O was sub-millisecond, exactly as expected, but there was a thin, persistent tail out past 16 milliseconds. An average would have hidden it completely; the mean was well under a millisecond. The histogram showed it plainly. That tail lined up with my slow requests, and it pointed me at the disk rather than the application, which is where I had been wasting my week.

A few others I now reach for without thinking:

execsnoop shows every process the system spawns, which catches the cron job or shell-out you forgot was running.
opensnoop shows every file being opened, which is how I once found a service stat-ing the same config file thousands of times a second.
tcplife shows TCP sessions with their duration and bytes moved, which turns "the network feels slow" into actual numbers.
funccount counts how often a given kernel or library function is called, so you can confirm or kill a theory about where the work is going.

The thread running through all of them is the same: they answer questions you previously could only guess at, and they answer them on the live system without a restart, a recompile, or a maintenance window.

A second terminal showing execsnoop tracing processes as they spawn

Why histograms, not averages

The biolatency output is worth dwelling on, because it captures why this approach changed how I work. Traditional tooling gives you summary numbers: average latency, total IOPS, percent utilised. Those numbers smear over exactly the behaviour that hurts. A device can be 99 percent fast and 1 percent catastrophic and report a lovely average. The histogram refuses to lie to you that way. It shows you the shape, and the shape is where the pain lives.

That is really the whole pitch for me. eBPF did not give me information I could not, in principle, have got some other way. People have been tracing kernels for a long time with heavier, riskier tools. What it gave me was the ability to ask a precise question of a production system, get a precise answer in seconds, and not have to take an outage or interrupt the service to do it. The cost of curiosity dropped to almost nothing.

A note on caution

It is not free of footguns. The tracing has overhead, small but real, and tracing very high-frequency events can add up. The tools need root and a kernel new enough to support what they attach to, which on older boxes means checking before you assume. And it is genuinely possible to attach to something hot enough to matter, so on a system under real load I start with the cheaper tools and the shorter sampling windows and work up.

But the trade is overwhelmingly worth it. I went from a black box with a few dials to something I can actually interrogate. The slow-request mystery that had survived a week of application profiling fell over in an afternoon once I could see what the kernel saw, and the answer, a slow disk tail, was sitting in plain view the whole time, below the floor of every tool I had been using. I just had not had a way to look down there. Now I do, and I find it hard to imagine going back.