I've spent years profiling things from the outside. Wrap the suspect code in a timer, log the duration, stare at percentiles, infer what the kernel must be doing from the shape of the latency. It works, sort of, in the way that diagnosing an engine fault by listening to it from the driver's seat works. This week I finally sat down with eBPF properly, and the difference is that you stop inferring and start watching.
The short version: eBPF lets you attach small, verified programs to events inside a running kernel and pull out data, with no module to compile, no kernel rebuild, and no reboot. You can hook a kernel function, a tracepoint, a syscall, even a function in your own userspace binary, and run a little program every time it fires. The kernel verifies your program can't loop forever or read where it shouldn't before it'll let it run, which is what makes this safe enough to do on a box that's actually serving traffic.
Why this is different from what came before
You could always do some of this. strace shows you syscalls but it's heavy, it serialises the target, and you would never point it at a busy production process. perf is excellent for sampling. SystemTap existed and could do a lot of this, but it compiled a kernel module each time and felt like something you'd only run on a box you were prepared to lose.
eBPF is the first time this kind of deep tracing has felt safe to run on something that matters. The verifier is the key. Because it proves your probe is bounded and well-behaved before loading it, the overhead is low and the blast radius is small. That changes the question from "can I afford to look?" to just "what do I want to look at?"
Getting started with bcc
You don't write eBPF bytecode by hand. The iovisor bcc project gives you a Python front end where your kernel-side probe is C in a string and the userspace side that reads the results is Python. More importantly it ships a pile of ready-made tools that already answer most of the questions you'll have, and reading their source is the best tutorial there is.
On a recent enough kernel, install the tools and just start running them. A few that earned their place in my first afternoon:
# which processes are opening which files
opensnoop
# the distribution of block I/O latency, as a histogram
biolatency
# every new process exec, with arguments
execsnoop
# slow ext4 operations over a threshold
ext4slower 10
biolatency was the one that made me sit up. Instead of an average, which always lies, it prints an actual histogram of disk latency in powers of two. The first time I ran it I expected a tidy cluster and got a clear bimodal shape: most operations fast, a distinct second hump out in the tens of milliseconds. That second hump was the tail that had been hurting us, made visible in a way no average ever would.
The problem it actually solved
I had a service that was mostly fine and occasionally, unpredictably, slow. The application metrics said the slow requests spent their time "in the database call", which is true and useless, because it doesn't say whether that's the query, the network, the disk, or something queueing behind a lock.
So I traced it from underneath. biolatency showed the disk had a fat tail. ext4slower then named the specific filesystem operations crossing ten milliseconds and which process owned them, and they lined up with a background job I'd half forgotten about that periodically rewrote a large file. The job and the service shared a disk. The job's bursts of writes were stalling the service's reads, and from inside the application that just looked like "the database was slow". No amount of staring at application logs would have told me that, because the cause was a layer below where the application can see.
The fix was dull, as good fixes are: move the noisy background job to its own volume and rate-limit its writes. The point isn't the fix, it's that I found the cause in an evening instead of guessing for a fortnight.
Writing your own, a little
Once the ready-made tools have whetted your appetite, writing a custom probe is less frightening than it looks. A toy that counts syscalls per process is genuinely a few lines: attach to the raw syscall tracepoint, key a hash map by PID, increment, and print the map on exit. The bcc examples directory has a dozen of these to crib from. You'll spend most of your time learning what the verifier won't let you do, which is a slightly humbling but ultimately reassuring conversation to have with a kernel.
A few caveats from week one
It is not all free. The probes do cost something, more on very hot paths, so you measure the measurement on anything truly latency-sensitive. Kernel version matters a great deal: tracepoints and helpers that exist on a current kernel may be missing on an older one, and a probe that hooks a specific internal function can break when that function is renamed between versions. And you'll want a fairly recent kernel to get the good stuff at all. None of that diminishes it. It's just the usual tax for living close to the metal.
Worth it?
Unreservedly. eBPF has changed the kind of question I'm willing to ask about a running system. The old reflex was to reason about the kernel as a black box and reach for it only when desperate, because the tools to look inside were either too blunt or too dangerous for production. Now the box is translucent, the tools are safe enough to run on a live host, and I can answer questions in an evening that I'd previously have argued about for days.
If you do anything that touches performance on Linux, spend a wet afternoon with the bcc tools. Run biolatency and execsnoop against something real and watch what the kernel has quietly been doing the whole time, just out of sight.