io_uring, and the joy of not making a syscall per read

A terminal on a dark desk

I finally sat down with io_uring this week, several years after everyone said I should, and the short version is that the model is genuinely lovely and the operational reality is more interesting than the benchmarks let on.

The thing it fixes is the syscall tax. The old way, with epoll, you wait for readiness and then make a syscall to actually do the read, and another for the write, and so on. Per operation, per connection, that adds up. io_uring flips it: you fill a submission queue with the operations you want, the kernel does them, and you pick the results up off a completion queue. You can batch a pile of work and hand it over in a single io_uring_enter, or with a polled setup, none at all. Submission and completion are shared ring buffers between you and the kernel, hence the name.

The first thing I noticed wasn't speed, it was the shape of the code. With liburing the loop reads naturally: prepare the operations, submit, reap completions, repeat.

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_sqe_set_data(sqe, conn);
io_uring_submit(&ring);

struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
struct conn *c = io_uring_cqe_get_data(cqe);
/* cqe->res is the byte count, or -errno */
io_uring_cqe_seen(&ring, cqe);

What I liked is that the same machinery handles disk and network. epoll never really did files; you'd bolt on a thread pool for anything that might block on storage. Here, a buffered read and a socket recv go through the same ring. That uniformity is the bit that feels like progress, more than any throughput figure.

Where the rings live

Now the caveats, because there are always caveats. The interface has moved fast across kernel versions, and the operation you read about in a blog post may not exist on the kernel you actually run in production. I tested on a recent kernel and got a different feature set than the slightly older one on a box I care about. Pin your assumptions to a kernel version, not to a blog.

The other thing is that io_uring has been a steady source of security advisories, enough that several large operators have simply disabled it via kernel.io_uring_disabled rather than audit every code path. That sysctl exists for a reason. If you're deploying this somewhere serious, that's a conversation to have with whoever owns your threat model before you reach for the performance.

So, first impressions: the programming model is the best thing to happen to Linux async I/O in a long while, and I enjoyed writing against it far more than I expected. I'm not rushing it onto anything I'd get paged for, but for a side project where I control the kernel and the blast radius, it's a delight. I'll be back with real numbers once I've stopped admiring the API long enough to measure it.