what a syscall actually costs you

Latency graph on a server console

Someone on the team claimed a syscall was "basically free". I disagreed, then realised I had no number to hand. So I measured it, because an argument without a number is just two people being loud at each other.

The classic cheap syscall is getpid(). It does almost nothing: jump into the kernel, read a value, jump back. If anything is going to show you the bare floor cost of the user-to-kernel transition, it's that. Here's the loop I used, nothing clever:

#include <unistd.h>
#include <sys/syscall.h>

int main(void) {
    for (long i = 0; i < 100000000L; i++)
        syscall(SYS_getpid);
    return 0;
}

I went through syscall() directly rather than the libc getpid() wrapper on purpose. glibc caches the pid these days, so getpid() mostly returns from a variable and never touches the kernel at all. That cache is great in production and useless when the whole point is to time the kernel crossing.

On a fairly ordinary Haswell box running a 4.4 kernel, that hundred-million-iteration loop comes out around 0.3 seconds of wall time, so roughly 3 nanoseconds a call once you account for the loop overhead. That sounds tiny, and on its own it is. The trouble is that 3ns is the floor, and almost nothing you actually do is at the floor.

perf output annotating the hot loop

The interesting part showed up under perf stat. The naked syscall is fast, but the moment you do real work, a read() that misses the page cache, a write() that flushes, anything that schedules, you are paying for the context switch, the cache pollution and the pipeline flush, not the 3ns crossing. I saw the same getpid loop jump to well over 100ns a call simply by adding meltdown-style cache pressure around it, and this was before anyone was talking about page table isolation. The crossing is cheap; the side effects are not.

Which is the actual point, and it's an old one: batch your syscalls. The reason writev, sendmmsg and friends exist is that the per-call cost, modest as it is, multiplies. A logging path doing one write() per line at a million lines a second is spending real time in the kernel for no good reason. Buffer it, flush it in chunks, and the syscall count drops by three orders of magnitude.

So: not free, but not the thing to optimise first either. Measure the workload, not the primitive. My colleague was right that a single getpid won't hurt you. I was right that "syscalls are free" is the sort of belief that eventually shows up as 30% system time in a flame graph and a very confused afternoon.