System calls in x86 are fast. Other archs behave differently. And the syscall time is not the only thing that matters, but potentially yielding execution
I thought they were fast because x86 has multiple register files, enough for kernel space and user space to have their own, so that entry/exit to system calls doesn't require flushing registers to L1 (in the common case).
If that's true, then one test where you have a single process spinning into and out of a single syscall will have very different performance characteristics than a test where you have more processes than processor cores, because context switches flush the TLB.
Somebody who knows actual things about x86 and so forth please tell me if I'm spouting 90s-era comp sci architecture textbook stuff that no longer applies.
They're fast because x86 has a decently fast privilege change mechanism for system calls and Linux works fairly hard to avoid doing unnecessary work to handle them. In the simplest case, registers are saved, a function is called, regs are restored, and the kernel switches back to user mode.
The asm code is fairly straightforward in Linux these days. I'm proud of it. :)