what does a syscall actually cost?

A graph overlaid on a rack of servers

A colleague said "syscalls are cheap, don't worry about it" in a review last week, and I realised I had been repeating the same line for years without ever putting a number on it. So I sat down and measured. The short version: a null syscall on my box costs about 500 nanoseconds, and that is roughly five times what it would have been before the page-table isolation mitigations landed. Cheap is relative.

This matters because the advice you inherited from a 2014 blog post may simply be wrong now. If you read "a syscall is a couple of hundred cycles" somewhere and built your mental model on it, that model predates KPTI, retpolines and IBRS. The floor moved.

the setup

The machine is a Ryzen 9 3900X running a 5.11 kernel on Arch, nothing exotic. I wanted the cheapest possible syscall so the number reflects the transition itself rather than the work done inside the kernel. getpid() used to be the classic choice, but glibc caches it now, so it never actually traps. The honest options are things the kernel cannot fudge.

I went with the raw syscall(SYS_getpid) form to bypass the cache, and cross-checked against a close(-1) which does nothing useful but always enters the kernel.

#include <sys/syscall.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>

static inline uint64_t now_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ull + ts.tv_nsec;
}

int main(void) {
    const long N = 50000000;
    uint64_t start = now_ns();
    for (long i = 0; i < N; i++) {
        syscall(SYS_getpid);
    }
    uint64_t end = now_ns();
    printf("%.1f ns/call\n", (double)(end - start) / N);
    return 0;
}

Fifty million iterations, so the loop overhead and the clock reads wash out. I pinned it to a single core with taskset -c 2 and set the governor to performance so the frequency wasn't bouncing around mid-run. Without that last step the numbers were all over the place, which is its own lesson.

Source code on a dark terminal

the numbers

With mitigations on, the default state of any sane machine today:

508.3 ns/call

Then I rebooted with mitigations=off on the kernel command line, which disables KPTI, the Spectre variant 2 retpoline business, and the rest. Same binary, same core, same governor:

96.7 ns/call

So the security mitigations are costing roughly 410 nanoseconds per syscall on this hardware. That is not a rounding error. At a couple of GHz that is well over a thousand cycles of pure tax, paid every time you cross the boundary.

To be clear: I am not suggesting you turn mitigations off. On a shared host or anything touching untrusted data that would be reckless. I turned them off for ten minutes to isolate one variable, then turned them back on. But it is worth knowing the size of the thing you are paying for.

what this means in practice

The interesting question is how many times you cross that boundary. A read() of one byte at a time through a loop pays the toll on every iteration. The same data read in 64KB chunks pays it a few hundred times instead of a few million. This is why buffered I/O exists and why io_uring has people excited: it batches submissions so you amortise the transition across many operations.

I redid the classic experiment, copying a 1GB file by hand, varying only the buffer size:

1 byte per read/write: minutes. Genuinely minutes. Almost all of it syscall overhead.
4KB: a couple of seconds, dominated by actual I/O.
1MB: marginally faster than 4KB, you're well past the point of caring.

The curve flattens hard once the buffer is large enough that the per-call cost disappears into the noise of the work being done. Past about a page or two you are optimising something that no longer matters.

So the old advice isn't wrong so much as incomplete. A syscall is cheap if you make few of them. It is ruinously expensive if you make a hundred million of them in a tight loop, and modern security mitigations have made the second case worse than it used to be. The fix has always been the same: do more work per crossing.

The number I'll carry around from now on is "half a microsecond." Not because it's precise on your hardware, it won't be, but because it's the right order of magnitude to reason with, and it's five times bigger than the figure I'd been quoting from memory.