Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

what a syscall actually costs, with numbers

Measuring the real cost of a syscall with a tight benchmark, and why batching with writev and io_uring exists at all.

A latency graph on a server dashboard

People throw around "syscalls are expensive" as received wisdom, and I realised I'd never actually measured one. So I did. The short answer on my Linux box, a recent kernel on a Ryzen, is that a trivial syscall costs in the region of a few hundred nanoseconds, and that number explains an enormous amount about why high-performance I/O code looks the way it does.

A syscall is a controlled trip from user space into the kernel and back. The CPU has to switch privilege level, the kernel saves and later restores your registers, and on modern hardware there is extra accounting on entry and exit that wasn't there a decade ago. None of it is free, and the cost is fixed per call regardless of how much work the call does. That last part is the whole story.

measuring it

The cleanest thing to measure is a syscall that does almost nothing, so the overhead is the dominant cost. getpid() used to be ideal, but glibc caches it now, so the call never reaches the kernel. The trick is to ask for the raw syscall and bypass the cache:

#include <sys/syscall.h>
#include <unistd.h>
#include <time.h>
#include <stdio.h>

int main(void) {
    const long iters = 10000000;
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (long i = 0; i < iters; i++) {
        syscall(SYS_getpid);   // bypasses the glibc cache
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    double ns = (end.tv_sec - start.tv_sec) * 1e9
              + (end.tv_nsec - start.tv_nsec);
    printf("%.1f ns per syscall\n", ns / iters);
    return 0;
}

On this machine that prints somewhere around 350ns per call. Your number will differ, and it will differ a lot depending on mitigations. With the Spectre and Meltdown mitigations enabled, which is the default and the sane choice, the entry and exit cost is meaningfully higher than it was on pre-2018 kernels. Boot with mitigations off and you'll see a faster number, and you'll also be running a machine I wouldn't want holding anything sensitive.

Source code on a screen

why it matters

350ns sounds like nothing. It is nothing, once. The problem is when you do it in a loop. Imagine writing a megabyte to a socket one byte at a time, a write() per byte. That's a million syscalls, 350 milliseconds of pure overhead before you count any actual work, to move a megabyte that the hardware could shift in microseconds. This is the canonical reason buffered I/O exists. fwrite and friends accumulate your bytes in a user-space buffer and make one syscall per buffer-full instead of one per write. The buffer isn't there to save memory, it's there to amortise the trip into the kernel.

Once you see syscall cost as the thing you're amortising, a whole family of APIs suddenly makes sense:

  • writev and readv let you hand the kernel several buffers in a single call instead of one call per buffer. Scatter-gather, one trip.
  • sendmmsg and recvmmsg do the same for network messages: many datagrams, one syscall.
  • io_uring takes it furthest. You submit a batch of I/O operations through a shared ring buffer and the kernel processes them without a syscall per operation at all. The entire design exists because the per-syscall cost, multiplied by millions of operations a second, became the bottleneck for high-throughput servers.

That progression is the same idea applied harder each time. The cost of crossing the user/kernel boundary is fixed, so the way to make I/O fast is to cross it less often, doing more per crossing.

the practical takeaway

None of this means you should fear syscalls. For ordinary code, making syscalls is completely fine and trying to avoid them is a waste of your life. A web handler that does a few reads and writes per request is nowhere near the regime where this matters, and buffered I/O already handles the common case for you. The number only becomes a problem at scale: tight loops, high-frequency I/O, the hot path of something doing millions of operations a second.

But knowing the actual figure changes how you reason. When someone says a syscall is "expensive", you can now say: about 350 nanoseconds on this box, more with mitigations, fixed regardless of payload, and that is precisely why I batch. It turns a vague piece of folklore into a number you can multiply by your call count and decide whether you care. Most of the time you won't. When you do, you'll already know which API to reach for. Measure your own box, though, because the mitigations make it vary more than you'd expect.