io_uring, first impressions

A terminal with kernel build output on a dark background

I spent the back half of December poking at io_uring, which has been the interesting thing happening in the Linux I/O layer for a little while now and which I had been quietly ignoring because the existing tools mostly worked. They mostly working is exactly the problem. "Mostly" is doing a lot of lifting in that sentence, and io_uring is an attempt to remove the asterisks.

The short version of my first impressions: it is genuinely clever, the model is a real departure rather than a polish on the old one, the ergonomics are sharp enough to cut yourself on if you write to the raw interface, and the early numbers are good enough that I understand the excitement. The longer version is below.

why it exists at all

Linux has never had a clean story for asynchronous I/O. The honest history is a bit embarrassing. epoll is excellent for readiness on sockets but it is a readiness model: it tells you a file descriptor is ready, and then you still make the actual read or write syscall yourself, which can still block in awkward cases. The older POSIX AIO and the kernel aio interface were meant to cover real asynchronous file I/O and never really delivered. aio only worked properly for O_DIRECT, had sharp limitations, and almost nobody reached for it happily.

So you ended up with a split brain. Network I/O went through epoll and was reasonably good. File I/O went through a thread pool, because the simplest way to make a blocking read not block your event loop is to do it on a thread you do not care about blocking. That works, but a thread pool for I/O is a workaround wearing a costume, and it costs you context switches, memory, and complexity you would rather not own.

The other quiet cost is syscalls themselves. Every operation is a transition into the kernel and back. At low rates that is free. At hundreds of thousands of operations a second, with the mitigations for the speculative-execution bugs of the last couple of years making those transitions more expensive than they used to be, the syscall overhead stops being a rounding error and starts being a line on the flame graph.

the ring model

io_uring replaces "ask the kernel to do one thing, wait, ask again" with two shared ring buffers that live in memory mapped by both the application and the kernel. There is a submission queue and a completion queue. You write submission queue entries describing the work you want done, you tell the kernel they are there, and the kernel posts completion queue entries when the work is finished. The data structures are shared, so in the good case you are not copying anything across the boundary, you are writing into memory the kernel can already see.

A close-up of a server's ports and indicator lights

The shape that surprised me most is batching. You can fill the submission ring with many entries and submit them with a single io_uring_enter call. One syscall, many operations. And in polled mode you can avoid the syscall on submission almost entirely, with a kernel thread watching the ring for you. The whole design is built around amortising or eliminating the boundary crossing that the old model paid on every single operation.

It is also general. It is not a sockets API or a files API, it is an asynchronous syscall API, and the set of operations it supports has been growing steadily across kernel releases. Read, write, accept, connect, send, receive, fsync, and more besides, all expressed as ring entries. That generality is the genuinely new idea. The same mechanism covers the file I/O and network I/O that used to need two different models.

using it without losing a weekend

I will be honest, the raw interface is not friendly. The setup involves mmap-ing the rings, managing the head and tail indices yourself, getting the memory barriers right, and reasoning about submission and completion as separate flows. It is the sort of code where an off-by-one in an index calculation does not crash, it corrupts, which is the worst kind of bug to chase.

The answer, which I wish I had reached for sooner, is liburing. It wraps the ring bookkeeping in a sane API and lets you think about operations instead of indices.

struct io_uring ring;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_submit(&ring);

struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
// cqe->res holds the result, or a negative errno
io_uring_cqe_seen(&ring, cqe);

That is the whole loop, more or less. Get a submission entry, describe the operation, submit, wait for the completion, mark it seen. Everything I wanted to do followed that pattern, and the library kept me well away from the barrier-and-index code that the raw interface would have made me write.

A practical note on actually running this: it moves fast and it is kernel-version-sensitive. Operations and flags that exist in a recent kernel simply are not there in an older one, and "not there" can mean a quiet -EINVAL rather than a clear failure. Check what your running kernel actually supports rather than what the latest documentation describes, because the gap between the two is real and the documentation is ahead of most distributions.

the early numbers

I am not going to publish a benchmark table off a few evenings of fiddling, because a clean benchmark of this is a project in itself and I do not trust my numbers enough to make claims with them. What I will say is that the direction was unmistakable. On a small file-read workload, the version using io_uring with a reasonable queue depth did meaningfully fewer syscalls than the thread-pool version doing the equivalent work, and the syscall count is the thing the design is built to attack. Fewer crossings, less per-operation overhead, and the batching meant the cost did not scale linearly with the operation count the way it does with one syscall per read.

The caveat I keep coming back to: this rewards the right workload. If you are doing a handful of I/O operations a second, none of this matters and the old code is simpler. The win is at high operation rates, where the per-operation overhead is the dominant cost and removing it changes the shape of the graph. Reach for it because you have measured a syscall or context-switch problem, not because it is new and interesting, however much it is both.

where I have landed

io_uring is the first thing in a while that has made me think the Linux I/O story might actually become coherent rather than a collection of special cases. It is early, the interface is unforgiving without liburing, and the version sensitivity means it is not a thing you sprinkle on casually. But the model is right, the trajectory is good, and for the kind of high-throughput service where I/O overhead is the bottleneck it is clearly where things are heading.

I will keep playing with it. For anything in production right now I would want a lot more measurement and a kernel I trust under me first. As a first impression, though: this is the good kind of clever, the kind that removes a class of workaround rather than adding a new one to learn.