the cost of a syscall, measured

A latency graph on a server dashboard

People throw the phrase "syscalls are expensive" around like it settles an argument. It doesn't, because nobody quotes a number, and the number has moved a lot in the last five years. So I sat down and measured it, on the actual hardware I care about, rather than repeating a figure I half-remembered from a conference talk.

The setup is deliberately boring. A loop that calls getpid() a few hundred million times, with the result used so the compiler can't elide it, timed with clock_gettime(CLOCK_MONOTONIC). getpid() is the cheapest syscall worth measuring: it does almost no kernel work, so what you're timing is the transition itself, not the thing on the other side.

for (long i = 0; i < N; i++) {
    sum += syscall(SYS_getpid);
}

Note the syscall() wrapper. glibc caches getpid() these days, so calling it directly measures a function call and nothing else. Going through syscall() forces the real transition every time.

A close-up of source code on screen

The headline number on this box, a fairly ordinary recent x86-64 server with the usual Spectre and Meltdown mitigations enabled, is around 350 to 450 nanoseconds per syscall. That sounds tiny until you put it next to a function call, which is single-digit nanoseconds, or an L1 cache hit, which is under one. The transition is two to three orders of magnitude more expensive than staying in userspace. That gap is the whole reason io_uring exists.

Then I rebooted with mitigations=off purely to see the delta, and the same loop dropped to roughly 70 to 90 nanoseconds. So somewhere between three quarters and four fifths of the cost on this machine is page-table isolation and the speculation barriers, not the architectural mechanism. I am not suggesting you turn mitigations off on anything that matters. I am suggesting that when someone says syscalls got slow, they are usually describing a security tax, not a law of physics.

A few things fall out of this once you have the number in front of you:

A workload doing 10,000 syscalls a second is spending a few milliseconds per second in transitions. Irrelevant. Stop optimising it.
A tight loop doing one syscall per 4KB of I/O at gigabytes per second is spending real time at the boundary, and batching with io_uring or larger buffers is the actual fix.
The advice "reduce syscalls" is only useful relative to how many you make. Profile first, because the difference between those two cases is four or five orders of magnitude.

The broader point is that "expensive" is not a property, it is a comparison. A syscall is expensive next to a function call and cheap next to a disk seek. Measure on the box you ship on, write the number down, and then you get to have the argument with evidence instead of folklore.