io_uring is the async i/o interface linux always needed

A Linux terminal showing kernel build output

Linux has never had a good async I/O story, and everyone who's tried to write one knows it. aio exists, technically, but it only works for unbuffered direct I/O, it blocks in cases it promises not to, and the consensus for years has been to quietly avoid it. So when Jens Axboe started posting the io_uring patches to the list over the past few weeks, I built a kernel off the tree and had a poke, because this is the first design in a long time that looks like it actually solves the problem rather than papering over it.

This is firmly experimental right now. It's a patchset, not something in a release I'd run anywhere serious, and the interface may well shift before it lands properly. So treat all of this as first impressions from someone who built it out of curiosity, not a deployment guide.

the idea: two ring buffers and no syscall per op

The core of it is a pair of shared ring buffers between userspace and the kernel: a submission queue and a completion queue. You write your I/O requests into the submission ring, the kernel picks them up, does the work, and posts results into the completion ring. The buffers are mmap'd and shared, so the common path involves no copying of the request structures across the boundary and, crucially, no syscall per operation.

That last bit is the whole point. The old model is a syscall per I/O, and in a post-Spectre, post-Meltdown world the cost of a syscall has gone up noticeably thanks to all the mitigation work. io_uring lets you batch: queue up a stack of reads, then make one io_uring_enter call to submit them all. With the polled submission mode you can even avoid that syscall in the steady state, the kernel side spins and picks work up on its own.

it does buffered I/O, which aio never did

The thing that immediately sold me over aio is that io_uring handles buffered I/O properly. That was always the killer limitation of the old interface: if your data was in the page cache, aio would happily block, defeating the entire purpose. io_uring does async buffered reads and writes without that lie, which means it's actually usable for the kind of workloads most people have, not just the niche of O_DIRECT database storage engines.

a first measurement

I wrote a small program against the raw interface to read a pile of files, batching submissions in groups rather than one syscall at a time. Even on this rough early tree the batching showed up clearly: submitting many requests per enter call cut the syscall overhead down to almost nothing compared to a read()-per-file loop doing the same work. I'm not going to quote a precise number because the tree is moving and my benchmark is scrappy, but the shape was exactly what the design promises. Fewer transitions, more work per transition.

The raw interface is fiddly, mind. You're managing ring indices and memory barriers by hand, and it's the sort of code that's easy to get subtly wrong. Axboe has mentioned a liburing helper library to wrap the sharp edges, which is the right call, because almost nobody should be poking at the rings directly.

why I'm paying attention

A few reasons this feels like it matters more than the usual kernel churn:

It's general. Not just block I/O, the design extends to other operation types, and you can see the ambition for it to become the async interface rather than another special case.
It's honest about buffered I/O, which is what actually killed aio for real workloads.
The batching model fits the hardware reality we're in, where syscalls are expensive and NVMe is fast enough that the software overhead is now the bottleneck.

It's early. I wouldn't build anything on it yet, and the interface I built today might not be the one that ships. But it's the first time in years I've looked at a new Linux I/O interface and thought "yes, that's the right shape". I'll be watching which kernel it lands in, and I suspect a good few storage and networking projects will be reaching for it the moment it's stable.