what a syscall actually costs you, with numbers

A performance graph on a monitor

"A syscall costs a few hundred nanoseconds" is one of those numbers everyone repeats and almost nobody has measured on their own hardware this decade. I got tired of guessing, so I sat down and measured it properly, because the gap between the folklore figure and the figure you actually pay turns out to matter a great deal for how you write I/O-bound code.

The short version: on the box in front of me, the cheapest meaningful syscall costs roughly 500ns when conditions are kind, and several times that once you account for the mitigations and cache effects that real workloads can't avoid. That's not a few hundred nanoseconds you can wave away. At a million calls a second it's half a core gone before you've done any actual work.

Measuring the floor

You want the cheapest syscall you can find, so the number reflects the transition itself rather than whatever the kernel does once it's there. getpid() is the traditional choice, except glibc caches it, so you have to go around the library and invoke it directly. The honest floor is something like getppid via the raw syscall() interface, in a tight loop, with the loop overhead subtracted.

#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>

int main(void) {
    const long N = 50000000;
    struct timespec a, b;
    clock_gettime(CLOCK_MONOTONIC, &a);
    for (long i = 0; i < N; i++) {
        syscall(SYS_getppid);
    }
    clock_gettime(CLOCK_MONOTONIC, &b);
    double ns = (b.tv_sec - a.tv_sec) * 1e9 + (b.tv_nsec - a.tv_nsec);
    printf("%.1f ns/call\n", ns / N);
    return 0;
}

Compile with optimisation, pin it to a core with taskset, and run it a few times to let the frequency governor settle. On my machine that lands around 60ns when nothing is fighting it. Which sounds wonderful, and is also a lie, because almost nothing real costs that little.

A word on why pinning matters: if the scheduler bounces your loop between cores, you pay for the migration and you measure a smeared average across CPUs that may not even be running at the same frequency. Pin it, disable turbo if you want stable numbers, and run it long enough that the timer resolution stops being a meaningful fraction of the result. Fifty million iterations is comfortably enough. None of this is fussy science, it's just the difference between a number you can defend and a number you read off once and quoted in a meeting.

A wall of source code

Where the number goes wrong

The 60ns figure is the kernel-entry mechanism with a warm cache and no real work. The moment you do something that touches user memory or actual kernel state, the picture changes:

Speculative-execution mitigations. Depending on your CPU and kernel boot flags, the entry and exit paths flush or fence things. Spectre and Meltdown mitigations alone can multiply the entry cost. Check /sys/devices/system/cpu/vulnerabilities/ to see what your kernel has switched on; the difference between mitigations=off and a hardened production box is not subtle.
Cache pollution. A real syscall like read() runs a chunk of kernel code that evicts your hot user-space data from L1 and L2. You don't see that cost in the syscall timing. You see it afterwards, as your own code runs slower because its working set got knocked out. This is the cost people forget, and it's the one that actually hurts in a real loop.
The work itself. read() on a small buffer, copying bytes user-to-kernel and back, is comfortably into the hundreds of nanoseconds before you've moved any useful quantity of data.

Put those together and a "cheap" read() of a few bytes in a real program, with mitigations on, sits in the low-microsecond range surprisingly often. That's twenty to fifty times the folklore number.

It's worth labouring the cache point because it's the one that catches experienced people out. When you benchmark a syscall in a tight loop, the kernel's hot paths stay resident and warm, so you measure a best case the kernel never actually delivers to a real workload. In production your code and the kernel's code are fighting over the same caches, and every trip across the boundary is a small act of vandalism against your own working set. The cost shows up displaced in time and in a different function, which is exactly why profilers often fail to pin it on the syscall: the syscall looks cheap, and the slowdown surfaces three functions later as "mysteriously slow user code". I've watched people optimise the wrong function for an afternoon because of this.

What this means in practice

The lesson isn't "syscalls are slow, panic". It's that the per-call overhead is large enough that the number of calls dominates, not the bytes moved. So you batch.

This is the entire reason a single read() of 64KB demolishes 64K reads of one byte each, far beyond what the copy cost alone explains. It's why buffered I/O exists. It's why writev() and sendmmsg() and friends are worth the awkward API: amortising one transition across many buffers is the win. And it's the design pressure behind io_uring, which lets you queue a batch of operations and cross the boundary once for the lot, rather than once per operation.

More source code on a dark terminal

A few rules I now hold to:

Count syscalls, not just bytes. strace -c on a hot path is often more revealing than a profiler, because it shows you the call volume you've been ignoring.
A buffer of a few KB in front of any small-write loop is almost always worth it, and costs you nothing in complexity.
Before you reach for io_uring, check you've actually got a syscall-rate problem and not a copy-bandwidth problem. They have different fixes, and the fashionable one isn't always yours.

The boring conclusion

The number I'll carry around now isn't a single figure, it's a range and a reason. The mechanism is tens of nanoseconds; what you actually pay, once mitigations and cache effects are in, is the low microseconds. Wide enough that "how many times do I cross the boundary" is the question that matters, and narrow enough that you can reason about it.

Measure it on your own hardware before you trust any of the above, mine included. The whole point of this exercise was that the cached folklore number was wrong for my box, and it's probably wrong for yours too, just differently.