"Syscalls are expensive" is one of those things everyone repeats and almost nobody has a number for. I got tired of waving my hands, so I measured one.
The trick is to pick a syscall that does almost nothing in the kernel, so what you're timing is the crossing itself rather than the work. getppid() is the classic choice. It just reads a field. Glibc doesn't cache it, so every call really does enter the kernel.
#include <unistd.h>
#include <stdio.h>
#include <time.h>
int main(void) {
long n = 50000000;
struct timespec a, b;
clock_gettime(CLOCK_MONOTONIC, &a);
for (long i = 0; i < n; i++) getppid();
clock_gettime(CLOCK_MONOTONIC, &b);
double ns = (b.tv_sec - a.tv_sec) * 1e9 + (b.tv_nsec - a.tv_nsec);
printf("%.1f ns/call\n", ns / n);
return 0;
}
On a fairly ordinary x86-64 Linux box this came out around 80ns per call for me. Your number will differ: CPU, kernel version, and whether spectre-style mitigations are in play all move it, though that last lot is mostly a story for the next couple of years rather than today.
So what does 80ns mean? On its own, nothing. It's a rounding error next to a disk read or a network round trip. The trouble is hot loops. If you're making a syscall per byte instead of per buffer, that 80ns gets multiplied by something with a lot of zeroes, and suddenly a third of your wall-clock time is spent crossing into the kernel and back for no good reason.
That's the actual lesson, and it's older than I am: batch your I/O. Read into a big buffer and parse from memory. Write through a buffered writer and flush deliberately. The point of measuring the syscall wasn't to optimise the syscall, it was to give myself a concrete sense of when a syscall in a loop has stopped being free and started being the bill.
If you take one thing away, make it this: don't trust the folklore, and don't trust me either. Run the loop on your own hardware. The number is cheap to get and it'll stop you arguing about it in code review.