the cost of a syscall, measured

A latency graph on a server dashboard

People throw around "syscalls are expensive" as if it settles an argument. It usually doesn't, because nobody can tell you how expensive, on what, compared to what. So I sat down with a box and a rdtsc loop and put an actual number on it, because a number you can argue with.

The setup is deliberately boring. One core pinned, governor set to performance, a tight loop calling getpid() a few million times. getpid() is the canonical "does nothing useful" syscall: it enters the kernel, reads a value, returns. Whatever it costs is roughly the floor for crossing the boundary at all.

unsigned long long start = rdtscp();
for (long i = 0; i < N; i++) getpid();
unsigned long long end = rdtscp();
printf("%.1f cycles/call\n", (double)(end - start) / N);

Source for the syscall timing loop

On this machine, an unremarkable Skylake-era server, a bare getpid() lands somewhere around 70 cycles when nothing is in the way. That is the comfortable, pre-2018 world. The trouble is that almost nobody runs in that world any more, because Spectre and Meltdown happened and the mitigations are not free.

Turn the mitigations on, which is to say leave the machine in its default secure state, and the same call climbs to several hundred cycles. The page-table isolation alone adds a TLB flush on the way in and out, and on this box it roughly triples the cost. You can see it directly by toggling pti=off on the kernel command line and re-running, though obviously don't do that on anything you care about.

So the honest answer to "how expensive is a syscall" is: a couple of hundred nanoseconds, give or take, and a meaningful chunk of that is the security tax we all agreed to pay in 2018. For a request handler doing one syscall, irrelevant. For a hot loop doing one per byte, ruinous.

Which is the actual point. The cost was never really about the syscall in isolation; it's about how many you make. This is why io_uring is interesting, why sendfile exists, why batching recvmmsg beats a recv per packet. The win isn't a faster syscall, it's fewer of them. Measure your own box before you optimise, the numbers move a lot between microarchitectures and kernel versions, but the shape of the lesson holds: amortise the boundary crossing or stop crossing it.