the cost of a syscall, measured

A performance graph on a server monitor

"A syscall is expensive" is one of those things everyone repeats and almost nobody has numbers for. So I sat down with a microbenchmark and an afternoon to put an actual figure on it, because "expensive" relative to what is the whole question. Expensive next to a function call, yes. Expensive next to a disk read, absolutely not. The interesting territory is in between, and that's where most performance decisions actually live.

the setup

The benchmark is deliberately boring. A tight loop calling the cheapest syscall I can find, measured against a tight loop doing the equivalent work without crossing into the kernel. The classic choice is getpid(), except the C library caches it, so you measure nothing. To force a real round trip into the kernel and back I use the raw syscall() interface, or getppid() which isn't cached:

#include <sys/syscall.h>
#include <unistd.h>

for (long i = 0; i < N; i++) {
    syscall(SYS_getppid);
}

Run that a few hundred million times, divide, and you get a per-call cost. The trick with any microbenchmark like this is to make sure the compiler can't optimise the loop away and that you're warm: caches primed, CPU frequency settled, no thermal throttling halfway through skewing the average.

A couple of practical points before the numbers, because they're the difference between a measurement you can trust and a number you made up. Pin the process to a single core with taskset so it isn't migrated mid-run. Disable frequency scaling, or at least confirm the governor has settled at a fixed clock, otherwise your first million iterations run at one frequency and the rest at another. And run it several times: if the figures wander by more than a few percent between runs, something in the environment is moving and your number isn't real yet. None of this is exotic, but skip it and you'll measure the scheduler and the thermal envelope rather than the syscall.

the numbers

On the machine I tested, a modern-ish x86-64 box running a recent kernel, a no-op syscall round trip came in at roughly 300 to 500 nanoseconds. A plain function call, by contrast, is sub-nanosecond once it's in the instruction cache. So the headline ratio is something like two to three orders of magnitude. That sounds enormous, and in a tight inner loop it is. In almost any other context it's noise.

A flame graph and timing measurements

The number that surprised people who'd memorised an older figure is that 300 to 500ns. A decade ago you'd have quoted closer to 100ns for the same thing. The cost went up, and it went up on purpose. The Spectre and Meltdown mitigations from 2018 onwards added work to the kernel entry and exit path: page table isolation, speculation barriers, the lot. Crossing the user/kernel boundary got measurably more expensive almost overnight, and depending on your CPU and which mitigations are enabled, the gap between a "mitigations on" and "mitigations off" kernel can be a factor of two or more on syscall-heavy workloads. You can see this directly: boot with mitigations=off and re-run the benchmark, and the per-call cost drops noticeably. Don't actually run production like that, but it's a clean way to see where the time goes.

So the honest answer to "what does a syscall cost" is: a few hundred nanoseconds on current hardware, more than it used to, and the increase is the price of not leaking your memory to a malicious web page.

when it actually matters

Here's the part that matters more than the raw figure. A few hundred nanoseconds is irrelevant if you do it occasionally. It becomes everything if you do it millions of times a second. The question is never "is a syscall slow", it's "how many am I making, and could I make fewer".

The classic offender is one-byte-at-a-time I/O. A naive loop calling read() for each byte is paying that syscall cost per byte, and you'll see it dominate the profile immediately. The fix is as old as Unix: buffer. Read a few kilobytes per syscall and amortise the boundary crossing across all of them. A 4KB buffer turns 4096 syscalls into one. That's not a clever optimisation, it's just not being wasteful, and it's the single biggest lever in most I/O-bound code.

When buffering isn't enough, the modern answer on Linux is io_uring. The whole premise is amortising or eliminating the syscall boundary for I/O. You submit a batch of operations through a shared ring buffer that both userspace and the kernel can see, and in the right mode the kernel polls that ring so you can submit work without making a syscall at all. For a server pushing hundreds of thousands of operations a second, removing the per-operation kernel crossing is a real, measurable win, and it's the kind of win that shows up as a step change in a flame graph rather than a rounding error.

But, and this is the part I'd press hardest, io_uring is genuinely more complex than read() and write(). Completion handling, lifetime management of the submitted buffers, a steeper debugging story when something goes wrong. It has also had its share of security issues, to the point where some hardened environments disable it outright. So you reach for it when you've measured a real syscall bottleneck, not because a blog post (this one included) made it sound exciting.

the takeaway

Measure before you reach for the heavy machinery. A syscall costs a few hundred nanoseconds today, up from the figure you might have in your head, thanks to side-channel mitigations you almost certainly want to keep enabled. That cost is meaningless until you're making a great many of them, at which point the first and best fix is almost always to make fewer: buffer your I/O, batch your work, and only then consider something like io_uring. The number is worth knowing precisely so you can tell, with confidence, when it doesn't matter, which is most of the time.