what a syscall actually costs you

A performance graph rendered on a terminal

"Avoid syscalls in the hot path" is advice you hear constantly and rarely see measured. So I measured it. The question I wanted answered was concrete: how many nanoseconds does the cheapest possible syscall cost on this machine, right now, after a year of kernel mitigations.

I picked getpid() because it does almost nothing. No I/O, no contention, no real kernel work. Whatever it costs is essentially the price of crossing the user/kernel boundary and coming back. I called it in a tight loop a few million times and divided.

#include <unistd.h>
#include <stdio.h>
#include <time.h>

int main(void) {
    struct timespec a, b;
    const long N = 10000000;
    clock_gettime(CLOCK_MONOTONIC, &a);
    for (long i = 0; i < N; i++) getpid();
    clock_gettime(CLOCK_MONOTONIC, &b);
    double ns = (b.tv_sec - a.tv_sec) * 1e9 + (b.tv_nsec - a.tv_nsec);
    printf("%.1f ns/call\n", ns / N);
    return 0;
}

One catch: glibc used to cache the PID, so this would have measured nothing at all. Since glibc 2.25 it doesn't, so the call really does hit the kernel each time. Good, that's what I wanted.

A snippet of timing code on screen

On my desktop, a fairly ordinary Skylake, I get a touch over 300ns per call. That's the headline. It used to be closer to 100ns. The difference is Spectre and Meltdown: KPTI swaps page tables on every kernel entry, and the retpoline and other mitigations add their own tax. We paid for those bugs in 2018 and we're still paying, one boundary crossing at a time.

Put 300ns in context. It's roughly a thousand cycles you're not spending on your actual work. For a request that does one syscall, irrelevant. For a tight loop reading a socket a byte at a time, it's the whole game. This is the entire reason io_uring and friends exist, and the older reason readv/writev and buffering exist: not because the syscall does much, but because the trip itself isn't free, and got measurably less free last year.

The practical takeaway hasn't changed, only got sharper. Batch your I/O. Buffer. Read big and parse in user space rather than reading small and crossing the boundary repeatedly. And when someone hand-waves about syscall cost, run the loop yourself. The number on your box, after your kernel's mitigations, is the only one that matters.