Previous discussion https://news.ycombinator.com/item?id=23132549
In practice io_uring can be used in many different ways, and it can be challenging to find the most efficient one.
also https://www.phoronix.com/news/Linux-6.0-IO-Block-IO_uring
edit: on the other hand, a good reason to disable uring in containers is that it's infested with vulnerabilities. It's new, complex, and does a whole lot of things - all of which make serious security bugs there quite common right now.
Now that I think about it, how does io_uring interact with landlock?
Current io_uring is not particularly prone to vulnerabilities. The original version of it had a design that often led to them (a kernel thread doing operations on behalf of the process and not always remembering to set the appropriate privileges), but it no longer uses that design, and the current design is much more resilient. Unfortunately, the original design led to a reputation that it's still trying to shake.
The tech industry: launch early! Develop in public! Many eyes make all bugs shallow!
Also the tech industry: we will never forgive you for that one segfault you had ten years ago.
How do the exploits for io_uring compare to the exploits for the rest of the kernel?
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=%22linux%20...
20 CVEs in 2024. Yes, some of them are not (exploitable) vulnerabilities, probably, because Linux CNA is being difficult. But many of them are, just ctrl+f privilege.
edit: it does seem it is disabled there now: https://github.com/containerd/containerd/pull/9320 (thanks to sibling comment for an adjancent link)
Unfortunately decided it's not worth it.
For other uses, uring has a "restriction" mechanism that does part of what you want. See REGISTER_RESTRICTIONS in the documentation. Any process that's setting up its own seccomp restrictions can also set up a uring with restrictions, limiting the opcodes it can use.
That said, that mechanism would benefit from a way to apply such restrictions to a process that isn't doing the setup itself, such as when setting up seccomp restrictions on a container or daemon. For instance, a way to set restrictions on all rings created by child processes, or a way for seccomp to enforce that any uring created has restrictions applied to it.
SELinux or your favorite MAC is there to solve this exact problem.
Do you have a reference for this? What is the anticipated timeframe?
I don't know when it'll be merged, but it seems like it's getting close to ready.
I mainly was trying to use ublk to implement a sort of fuse like thing with the kernel handling the fs and thus having inotify support.
Now, the FUSE daemon could generate the event, but correctly generating events (especially handling edge cases) is difficult.
Good. That's a forcing function for making io_uring work in your environment.
> bypasses seccomp
Seccomp sucks.
We shouldn't be enforcing security by filtering system calls, the set of which will grow forever, but instead by describing access control rules on objects, e.g. with SELinux. If your security policy is that your sandbox should be able to read from some file but not write to it, you should do that with real MAC, which applies to all operations , il_uring included. You shouldn't just filter read(2) and write(2) in particular.
We shouldn't hold back evolution in systems interfaces because some people are stuck on bad ways of doing things and won't move.
Another one is I could not find a benchmark with io_uring - this would confirm the benefit of going from epoll.
Certainly not; it's likely to make it run faster, since you can use the elevator algorithm more efficiently instead of seeking back and forth between the files. You can easily measure this yourself by using comparing wcp, which uses io_uring, and GNU cp (remember to empty the cache between each run).
One of the advantages of io_uring, unrelated to performance, is that it supports non-blocking operations on blocking file descriptors.
Using io_uring is the only method I recall to bypass https://gitlab.freedesktop.org/wayland/wayland/-/issues/296. This issue deals with having to operate on untrusted file descriptors where the blocking/non-blocking state of the file descriptions might be manipulated by an adversary at any time.
For sockets, `MSG_DONTWAIT` works with both `recv` and `send`.
For pipes you should be able to do this with `SPLICE_F_NONBLOCK` and the `splice` family, but there are weird restrictions for those.
You can use io_uring with epoll to monitor eventfd to wake up your sleeping with io_uring wait for completions.
I have implemented a barrier and thread safe techniques that I am trying to turn into a command line tool
My goal is that thread safe performant servers are easy to write.
I am using bloom filters for fast set intersection. I intend to use Simd instructions with the bloom hashes.
Issues with io_uring security mostly stemmed from an old architecture and just the fact that there's a ton of surface area.
There's nothing wrong with the general concept.
io_uring is such a tremendous improvement over epoll, in both speed and user experience. With sqpoll, vectored ops and proper batching you can get some crazyy speed. I am definitely looking forward to seeing some of these seccomp and privilege issues getting fixed and getting container support in the future.