"It's slow because of the syscalls" is the kind of thing engineers say with confidence and no numbers. I've said it myself. So I sat down to actually measure what a system call costs, because a guess you've repeated often enough starts to feel like a fact, and I wanted to replace the feeling with a figure.
The short version: on the machine I tested, a trivial syscall costs a few hundred nanoseconds round trip, and the surprising part isn't the number, it's how that number behaves once you start making millions of the things.
The benchmark
I picked the cheapest syscall I could think of, one that does almost nothing in the kernel, so I'd be measuring the crossing itself rather than the work. getpid() is ideal: it returns a number the kernel already knows, with no I/O, no locking, no allocation. Call it in a tight loop a few million times and divide.
#include <unistd.h>
#include <stdio.h>
#include <time.h>
int main(void) {
const long n = 10000000;
struct timespec a, b;
clock_gettime(CLOCK_MONOTONIC, &a);
for (long i = 0; i < n; i++) {
getpid();
}
clock_gettime(CLOCK_MONOTONIC, &b);
double ns = (b.tv_sec - a.tv_sec) * 1e9 + (b.tv_nsec - a.tv_nsec);
printf("%.1f ns/call\n", ns / n);
return 0;
}
One trap: glibc used to cache the pid, so getpid() wouldn't actually enter the kernel at all and you'd measure nothing. That caching was removed a while back, so a current glibc does make the real call, but if your numbers come out suspiciously low, that's the first thing to check. To be sure I was measuring the kernel crossing, I also ran a version calling the syscall directly via syscall(SYS_getpid), which bypasses any library cleverness. The two agreed.
The number
On the box I used, the direct syscall came out around 300 nanoseconds per call. That's the cost of switching from user mode into the kernel and back, with the trivial work in between rounding to nothing. Three hundred nanoseconds sounds like rounding error, and for a single call it is. You'd never notice one.
The point is what happens when you stop making one and start making millions.
At 300ns each, ten million syscalls is three seconds of pure overhead, doing no useful work, just crossing the boundary back and forth. And ten million is not a large number. A program reading a file one byte at a time issues one read per byte. A megabyte file is a million syscalls before you've done anything with the data. That's where the "it's the syscalls" intuition comes from, and measuring it makes the intuition concrete instead of folkloric.
Why buffering wins
This is the whole argument for buffered I/O in one sentence: it trades a million cheap-looking syscalls for a few thousand. Read a file in 64KB chunks instead of single bytes and you've cut the syscall count by a factor of sixty-five thousand. The per-call cost didn't change. You just stopped paying it so often.
I demonstrated it crudely by reading a 100MB file two ways, one byte per read and 64KB per read. The byte-at-a-time version spent its entire life in syscall overhead and took the better part of a minute. The buffered version finished in well under a second. Same data, same disk, same kernel. The only difference was how many times we crossed the boundary.
This is also why batching shows up everywhere once you start looking. writev instead of many write calls. sendmmsg to push multiple packets in one go. io_uring, which was just landing in the kernel around now and takes the idea to its logical end by letting you queue many operations and submit them with a single crossing. They're all the same trick at heart: the syscall is cheap individually and ruinous in bulk, so make fewer of them.
The honest conclusion
A syscall is not expensive. Three hundred nanoseconds is nothing. What's expensive is treating something that costs three hundred nanoseconds as if it were free and then doing it ten million times. The fix is almost never to make the call faster, you can't, it's the kernel's number, not yours. The fix is to make fewer calls: buffer, batch, and amortise the crossing over as much real work as you can.
So I've retired "it's the syscalls" as a vague accusation and replaced it with a question I can answer: how many syscalls is this code making, and can I make it issue fewer? That's a thing you can measure with strace -c in about thirty seconds, and the count, more than the per-call cost, is nearly always where the answer lives.