Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

ebpf, or how i stopped guessing about a slow box

Using bcc and eBPF to trace a mysteriously slow server in production without strace, perf overhead or a maintenance window.

A monitoring graph on a server screen

We had a box that was slow in a way nobody could explain. CPU sat at maybe 30 per cent, memory was fine, disks weren't saturated by any metric we trusted, and yet requests that should have taken five milliseconds were occasionally taking two hundred. The averages looked healthy. The tail was on fire. And the usual tools were either too blunt or too dangerous to point at a production process under load.

strace was out: attaching it to a busy process can slow it down by an order of magnitude, and a box that's already misbehaving is not the place to add a tax like that. perf was better but still felt heavy, and sampling profilers tell you where time goes on average, which is exactly the question I wasn't asking. I wanted to know about the rare slow request, the one in a thousand, and averages bury those by design.

This is where eBPF earned its keep. If you've not met it, eBPF lets you load small, verified programs into the kernel that run on events: a syscall, a function entry, a tracepoint, a disk I/O completion. The verifier guarantees they can't crash the kernel or loop forever, so you can run them on a live production box without the cold sweat that comes with kernel modules. The bcc toolkit wraps all this in Python so you don't have to write the bytecode by hand.

Starting with the off-cpu question

The first useful question was: when a request is slow, what is it actually waiting on? Not burning CPU, waiting. That's off-CPU time, and it's invisible to a normal profiler because the thread isn't running. bcc ships offcputime, which traces scheduler events and records the stack each time a thread goes off-CPU and how long it stayed there.

I ran it for thirty seconds:

# offcputime-bpfcc -p $(pgrep -n ourservice) 30

The output is folded stacks with microsecond totals. Most of it was the expected epoll wait, threads parked waiting for work, which is fine. But buried in there was a stack going through a logging call into a futex, blocking for tens of milliseconds at a time. A lock. Under our logger.

A close-up of code on a terminal

Confirming it with the actual numbers

A stack trace is a lead, not a verdict. So I reached for funclatency to measure how long that logging path actually took, as a histogram rather than an average:

# funclatency-bpfcc -p $(pgrep -n ourservice) -u 'c:write' 30

The histogram had two humps. The vast majority of writes finished in single-digit microseconds. A small but real population sat up at tens of milliseconds. That bimodal shape is the signature of contention: most of the time you get the lock immediately, occasionally you queue behind someone holding it, and the occasionally is what was eating our tail latency.

The culprit turned out to be embarrassingly ordinary. Our logger was configured to flush synchronously on every line, and under load multiple threads were serialising behind a single mutex and an fsync-happy file handle. On a quiet box you'd never notice. Under our peak traffic, the queue formed, and every thread that wanted to log a line had to wait its turn behind a disk write. The 200ms requests were simply the ones unlucky enough to log whilst the flush was in flight.

Why this mattered beyond the fix

The fix was boring: buffer the log writes and flush on a timer off the hot path. Tail latency dropped back into single-digit milliseconds within minutes of the deploy, and the bimodal histogram collapsed into one hump.

What I want to remember is the method, not the bug. For years my debugging on a slow production box was a mixture of guessing, reading code, and adding log lines that themselves changed the timing. eBPF let me ask the kernel direct questions, on the live system, at the moment it was misbehaving, with overhead low enough that I didn't have to schedule a window or hold my breath. No recompile, no restart, no instrumentation baked into the app.

A few things I'd pass on if you're starting:

  • Reach for offcputime when something is slow but not busy. Waiting is invisible to CPU profilers.
  • Trust histograms over averages. A bimodal distribution is contention until proven otherwise.
  • The verifier is your friend, not an obstacle. It refusing to load your program usually means your program was about to do something daft.

There's a learning curve, and bcc's Python wrappers paper over a lot of genuine complexity underneath. But the payoff is being able to see what the kernel sees, on a box that's actively on fire, without making the fire worse. After years of inference, that directness still feels slightly magical.