"Syscalls are expensive" is one of those things everyone repeats and almost nobody quantifies. I wanted an actual number for my own hardware rather than a hand-wave, so I measured it. The headline: a trivial syscall on this machine costs a few hundred nanoseconds. That sounds tiny until you remember how many of them a busy program makes.
The simplest measurement is a tight loop calling something that does almost nothing in the kernel. getppid() is the classic choice because it can't be served from userspace by the vDSO, unlike gettimeofday, so you actually pay the full transition.
for (long i = 0; i < N; i++)
syscall(SYS_getppid);
Divide the wall time by N and you get the round-trip cost of entering the kernel, doing nearly nothing, and coming back. On this box it lands around 300ns per call. For comparison, an ordinary function call is sub-nanosecond, so you're paying roughly a thousandfold to cross the boundary.
Why so much? The transition itself has a cost, but the larger hit is what it does to your caches and pipelines. Switching privilege levels disrupts the CPU's branch predictor and pollutes caches with kernel code and data. The Spectre and Meltdown mitigations from a few years back made this meaningfully worse, because KPTI swaps page tables on the way in and out and the various barriers stop the CPU speculating across the boundary. The cost isn't just the instructions you can see; it's the speculative work the processor can no longer do.
This is why the advice is always "batch your syscalls", and why the numbers bear it out. Reading a file one byte at a time with a raw read() per byte is catastrophic, not because reading a byte is hard but because you pay 300ns of overhead for each one. Buffered I/O reads a big chunk in one syscall and serves the bytes from userspace. The syscall cost amortises across thousands of bytes and effectively vanishes.
The same logic drives the more modern interfaces. epoll exists so you can wait on thousands of file descriptors with one syscall instead of one per descriptor. io_uring, which I've been watching with interest, takes it further by letting you submit and reap whole batches of operations through shared ring buffers, so a busy server can do real work with barely any kernel transitions at all. The point of all of it is the same: the per-call number is small, but it's a tax, and the way to avoid a tax is to trigger it less often.
The practical takeaway for me was unglamorous. Before reaching for anything clever, the first question on a syscall-heavy workload is simply "how many am I making, and can I make fewer?" strace -c answers the first half in seconds, and more often than not the second half is just a buffer I forgot to add.