Ramblings of an aging IT geek
← Ramblings of an aging IT geek
performance

how much does a syscall actually cost?

Measuring the real per-call cost of a syscall on modern Linux, why it is more than it used to be, and how batching syscalls changed one service's throughput.

A performance graph on a monitor beside a server

"A syscall is cheap" is one of those things everyone repeats and nobody measures. I had a service that was spending an embarrassing amount of wall-clock time in the kernel, and before I went rewriting anything I wanted an actual number for what one crossing of the user/kernel boundary costs on the hardware I was running on. The answer turned out to be more interesting than I expected, and quite a bit larger than the folklore figure.

The folklore is "a few hundred nanoseconds". That was true once. It is not really true now, and the reason is the pile of CPU mitigations we have all been carrying since 2018.

measuring the floor

The cheapest syscall you can make is one that does almost nothing in the kernel. getpid() is the traditional choice, except glibc caches it, so you have to go around the library and call it directly. A tight loop calling the raw syscall a few million times, divided by the iteration count, gives you a usable floor.

#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>

int main(void) {
    const long iters = 20000000;
    struct timespec a, b;
    clock_gettime(CLOCK_MONOTONIC, &a);
    for (long i = 0; i < iters; i++)
        syscall(SYS_getpid);
    clock_gettime(CLOCK_MONOTONIC, &b);
    double ns = (b.tv_sec - a.tv_sec) * 1e9 + (b.tv_nsec - a.tv_nsec);
    printf("%.1f ns/syscall\n", ns / iters);
    return 0;
}

On the box I tested, a fairly ordinary Xeon a couple of generations old, this came out around 450 ns with the default mitigations enabled. That is not a few hundred nanoseconds of "the call is basically free". That is real money when you are doing it millions of times a second.

To see how much of that is mitigations rather than the call itself, you can boot with mitigations=off and re-run. I would not run a production box that way, but as a measurement it is illuminating: the same loop dropped to roughly 110 ns. So on this hardware, three quarters of the cost of a trivial syscall was Spectre and friends, not the boundary crossing.

Source code on a screen

the floor is not the ceiling

That 450 ns is the floor. A syscall that actually does something, read on a socket, say, carries the boundary cost plus whatever work the kernel does, plus the cache and TLB damage from having been in kernel mode at all. The last part is the sneaky one. After a syscall returns, your hot user-space data may have been partially evicted, so the next few user-space instructions stall on memory that used to be in L1. That cost does not show up in a microbenchmark that does nothing but syscall in a loop, because the loop has no working set to evict. It shows up in real code, and it is the reason microbenchmarks always make syscalls look cheaper than they are in situ.

what i did about it

The service in question was reading from a socket one message at a time and writing one response at a time. Two syscalls per message, at maybe 800k messages a second across the fleet. The fix was not clever, just recvmmsg and sendmmsg, the vectored variants that handle a batch of messages per call.

struct mmsghdr msgs[64];
/* fill msgs ... */
int n = recvmmsg(fd, msgs, 64, MSG_DONTWAIT, NULL);

Batching 64 messages per call turned two syscalls per message into roughly two syscalls per 64 messages. The per-message syscall overhead fell by about that factor, and CPU on the hot path dropped by a third under load. Throughput went up because we were no longer burning cores on boundary crossings. None of the application logic changed.

More source code on a screen

the takeaways

Measure on your own hardware, because the mitigation tax depends entirely on the CPU and which workarounds are active. Treat the microbenchmark number as a floor and assume real cost is higher because of cache effects you cannot easily see. And before you reach for io_uring or anything exotic, check whether the boring vectored syscalls already solve it. They usually do, and they are a much smaller change to reason about.

The wider point: "a syscall is cheap" stopped being a useful default around 2018. If your hot path crosses the boundary per item, that is now a number worth knowing rather than a thing worth hand-waving.