(Trick question! A computer has for a long time been a distributed system, we just don't like to worry about that usually.)
I still have some of the Sun manuals with that sentence.
It is kind of ironic how many decades we have been doing distributed systems, and now everyone talks about microservices as if rediscovered powder.
Also worth mentioning Zig has a realy great iouring library[2] right in the stdlib, which without looking up the source code I am guessing is what bun uses under the hood:
[1] https://nodejs.org/api/fs.html#promises-api
[2] https://ziglang.org/documentation/0.13.0/std/#std.os.linux.I... - (give it a minute to load, the zig people are great system programmers but webdevs they are not lol)
But that structure/API is very similar to many similar patterns in computing. Look at Smalltalk or OOP to some degree. Maybe extend it with operational transforms.
I think there are a lot of interesting alternative ways to look at operating systems and many developments over the years. Such as Plan 9, MirageOS, and several other projects.
Also just to clarify I love cats and need a better metaphor.
So now we can study Actor problems and apply it to io_uring to avoid pitfalls earlier.
Also, "the mail system" (queues, message dispatch) is not part of the definition of the Actor Model. It's an implementation detail.
But it's entirely up to the actor if it wants to or not. I could imagine an actor supervision-hierarchy with directory-actors that have child file-actors and other child directory-actors. The file-actors could just be leaves in the hierarchy.
Each write operation on a file would be guaranteed to be single threaded via the file-actor. But a file-actor could also launch ephemeral child-actors that do read-only processing on a snapshot of the file. So, parallel processing would be possible for read-only operations.
A file-system does fit quite elegantly into the actor model imho. Whether it would be efficient or not, who knows, but at least on the surface it fits.
That is, asynchronous I/O is 20 years older than the Unix system call interface this article speculates it should replace.
Of course, context switching between different tasks is not free, and event loops have frequently been able to provide higher efficiency. The equilibrium has rocked back and forth as I/O has gotten faster and slower relative to task context switching. CICS, select(), poll(), Oberon, the Macintosh system, Win16, the JavaScript event loop, Tcl/Tk, Win32 IOCP, Symbian active objects, kqueue, epoll, and io_uring are some of the results.
But don't try to sell asynchronous I/O as a "game-changing" paradigm shift. It's a different programming model that's harder to program but can provide higher performance, just like it has been for 70 years.
If you're shopping for a paradigm shift that can improve this tradeoff, there are several candidates. Erlang-style lightweight processes, software transactional memory, and JS-style promises (originally from E, which was inspired by KeyKOS, which had event loops but no promises or asynchronous I/O) come to mind.
The hardware development that might be actually new has arguably already failed in the market: Intel/Micron's "Optane" memory and Flash-based NVDIMMs. Flash has big disk energy, but is fast enough that copying it to RAM one word at a time like disk will probably bottleneck your performance by an order of magnitude. io_uring doesn't fix this. We need interfaces designed for zero-copy access to bulk persistent data. Maybe something like Multics or mmap(). LMDB and FlatBuffers suggest that the potential for improved performance is significant. Could such high-bandwidth, low-cost memory keep up with LLM inference, allowing you to do inference on a 256-gigabyte model with 256 gigabytes of Flash but a much smaller amount of RAM?
2. Message passing isn't new, it is at least 50 years old even if you don't go further back than early Smalltalk and the Actor model.
3. The article wasn't "selling" anything, and certainly nothing new see (2). It was noting convergence from several directions on an old and somewhat misunderstood paradigm.
Instead, he jumps into salesman hypester mode with "The game has changed," which, like, gag. A lot of games are having their rules rewritten right now (tank/drone warfare, the energy market, freedom of expression, international finance, and artificial intelligence come to mind) but asynchronous I/O is not one of them. (Except, maybe, in the non-io_uring-related way I suggested—the advent of much-higher-bandwidth access to large-capacity storage devices than I/O buses can handle—to which message-passing is even less applicable.)
It's perfectly fair to describe actors as "somewhat misunderstood" because the ways Hewitt himself understood it over the 50 years he developed the idea frequently contradict one another. At the end of his life he spent several years writing https://arxiv.org/abs/0904.3036v12, which describes his conceptualization of it at that time, which was very different from the early versions, though I think he would deny that. The versions on the arXiv only go back to 02009, but I am pretty sure the draft paper he showed me when I met him a few years before that was an earlier draft of the same thing.
The existing "asynchronous" I/O mechanisms that I am aware of all use a procedural interface...which doesn't really work.
The first kind of procedural interface is what I would call "the simulation of synchronous I/O". So basic OS behavior that goes outside the procedural model supported in the programming language by suspending your process and going off to do something else while the I/O completes.
This has various problems, mainly the one of suspending your process, but it is nice and simple from the perspective of a program in the call/return architectural style because it never has to see anything outside its understanding of the world.
The attractive convenience and the intrinsic problems have led us to reproduce this mechanism at the process level, the kernel-thread level, the user-thread level and most recently the async/await level. The fundamental flaw remains: we are simulating a synchronous procedural interface on top of something that is very different[1]
Callback hell is another way of mapping asynchrony to synchronous procedural interfaces, but well...yikes. NT completion ports and the like are as well, and let's agree not talk about aio(4).
Asynchronous messaging such as that in io_ring is a different way of interfacing with asynchronous I/O, just like Erlang messages are different from synchronous procedure calls and synchronous RPCs. Instead of all communication being encoded in individual procedures, it is encoded in reified messages (the io_uring_sqe struct). These have an "opcode", the message name, and parameters.
Now you can ignore the completely different interface that is the point and instead focus on the underlying asynchronous I/O operations, but that is, well, missing the point.
I have built more asynchronous/message-oriented I/O APIs in userspace with and for Objective-S[2], and am personally very interested in how these could map to the io_ring kernel interface. I certainly agree with the poster's point that this is fundamentally different from what has come before. And again: the messaging interface to (inherently asynchronous) I/O, not the fact that there is some (procedural) mechanism for asynchronous I/O.
[1] https://2020.programming-conference.org/details/salon-2020-p...
NT IOCP are true async as all I/O in the NT kernel is asynchronous. It was a design principle.
NT also has I/O Rings, based on io_uring.
If you don't find my perspective of interest, you of course have no obligation to read it, but if I didn't think something like my comments were worth reading, I wouldn't have written them. It's true that they're longer than your post—but that's because your post is unfinished!
Sometimes my writing falls short of the mark, but in this case, other people seem to have found my comment worthwhile, and I think that it turned out to be much higher quality than I had hoped. This time I think you're missing out by skimming.
I cut my teeth on OS/2 in the early 90s, where using threads and processes to handle concurrent tasks was the recommended programming model. It was well-supported by the OS, with a comprehensive API for process/thread creation, deletion and inter-task communication. It was a very clear mental model: put each sequential sequence of operations in its own process/thread, and let the operating system deal with scheduling - including pausing tasks that were blocked on I/O.
My next encounter was Windows 3, with its event loop and cooperative multi-tasking. Whilst the new model was interesting, I was perplexed by needing to interleave my domain code with manual decisions on scheduling. It felt haphazard and unsatisfactory that the OS didn't handle scheduling for me. It made me appreciate more the benefits of OS-provided pre-emptive multi-tasking.
The contrast in models was stark. It seemed obvious that pre-emptive multi-tasking was so obviously better. And so it proved: NT bestowed it on Windows, and NeXT did the same for Mac.
Which brings us to today. I feel like I'm going through groundhog day with the renaissance of cooperative multi-tasking: promises, async/await and such. There's another topic today [0] that illustrates the challenges of attempting to performs actions concurrently in javascript. It brought back all the perplexion and haphazard scheduling decisions from my Windows 3 days.
As you note:
> Of course, context switching between different tasks is not free, and event loops have frequently been able to provide higher efficiency.
This is indeed true: having an OS or language runtime manage scheduling does incur an overhead. And, indeed, there are benchmarks [1] that can be interpreted as illustrating the performance benefits of cooperative over pre-emptive multitasking.
That may be true in isolation, but it inevitably places scheduling burden back on the application developer. Concurrent sequences of application domain operations - with the OS/runtime scheduling them - seems like a better division of responsibility.
[0]: https://news.ycombinator.com/item?id=42592224
[1]: https://hez2010.github.io/async-runtimes-benchmarks-2024/tak...
Somehow we collectively took all the incredible experience with cooperative multitasking gathered over literally decades prior to Node and just chucked it in the trash can and had to start over at Day Zero re-learning how to use it.
This is particularly pernicious because the major issue with async is that it scales more poorly than threads, due to the increasing design complexity and the ever-increasing chances that the various implicit requirements that each async task has for the behavior of other tasks in the system will conflict with each other. You have to build systems of a certain size before it reveals its true colors. By then it's too late to change those systems.
The mistake most people are making these days is mixing paradigms within the same thread of execution, sprinkling async throughout explicitly or implicitly synchronous architectures. There are deep architectural conflicts between synchronous and asynchronous designs, and trying to use both at the same time in the same thread is a recipe for complicated code that never quite works right.
If you are going to use async, you have to commit to it with everything that entails if you want it to work well, but most developers don't want to do that.
The systems that can potentially benefit from async/await are a tiny subset of what we build. The rest just don't even have the problem that async/await purports to solve, never mind if it actually manages to solve it.
To this day it still seems it had a much better approach to components development and related tooling, than even COM reboot as WinRT offers.
The elephant in the room is that using high-bandwidth storage well with minimal RAM requires different and much better scheduler design than currently exists in the vast majority of systems, including every OS I am familiar with. The higher the bandwidth and the larger the storage, the better your scheduling needs to be at latency hiding using techniques that don't involve a large cache. A lot of hardware tech, like Optane or HPC fabrics, was invented to avoid having to address schedulers being poor at latency hiding, which is essentially a (non-trivial) software problem.
Most users of modern fast asynchronous I/O still tend to delegate scheduling, treating the I/O schedule and execution schedule as separate concerns, even when done entirely in user space. It is a missed opportunity.
You do raise a valid issue: this requires a much more sophisticated design and implementation than most developers are comfortable with, which creates a lot of inertia behind doing things the classic way. This could all be abstracted away from the average developer in principle, something similar to a database kernel, but no one has built one yet.
My perhaps naïve thoughts on "extremely high bandwidth" are that the bus from the NAND Flash matrix to the Flash chip's internal RAM buffer is typically something like 4 kibibytes wide, and maybe you can read a page of the Flash into that buffer in hundreds of nanoseconds, though the chips I've looked into take tens of microseconds. (And if you can cut the latency down that much, the scheduler problems get much easier, too.) If those buffers are then accessible over the CPU's memory bus, maybe you can usefully transfer pages from the Flash into the buffers at many times the bandwidth of the CPU's memory bus, as long as you're only reading a small part of each page. As I understand it, current SSDs mostly only approach the bandwidth of SDRAM if you're reading entire pages.
Back of the envelope: if you have 32 Flash chips in your system, and each one of them can read a 4096-byte page from its NAND matrix into a chip-internal RAM buffer every 100 ns, you'd have 1.3 terabytes per second of such bandwidth. However, this illustrates how demanding this kind of access would be to the hardware; basically it demands Flash as low-latency as SDRAM.
You can definitely mmap() parts of devices or files that are larger than your CPU's virtual memory. mmap() does not require you to map the entire device or file.
I agree that you need some abstraction layer that provides a simple, reliable interface with adequate performance.
Latency is always going to have a speed-of-light issue, and storage is being moved physically further from the CPU with time; latency reductions in silicon are re-added by distance. Flash gets around that with extreme parallelism, which implies very deep pipelining from the execution scheduling side to fill all of the I/O slots necessary to saturate the bandwidth.
This in turn creates a raft of second-order design problems on systems with extremely large and extremely parallel storage. Total memory requirements for just the execution state being scheduled, the data structures that provide data selectivity (so you can choose the optimal I/O to schedule), and landing buffers for parallel inflight I/O can easily overflow available RAM on large servers in plausible environments. And databases normally keep much more than just that in RAM. It is an interesting open architecture problem that has never been considered. You can't trivially patch existing architectures to make it work, it would need to look very different.
We've taken it as axiomatic that certain data structures will always fit in RAM when dealing with large data, but extremely parallel, fast, and large storage is exposing that assumption in interesting ways. An architecture that effectively decouples memory requirements from large, fast storage would radically change how we design data intensive software, since there are a lot of design idioms and limitations today that primarily reflect RAM scaling issues.
That's just a scheduler though, and not necessarily an actor-oriented one. Multitasking doesn't imply communication between tasks, certainly not actor-oriented bidirectional message passing.
Not helpful.
I found your other comment above, the one mentioning Smalltalk, very interesting, and will reply to it later after thinking more about it.
1. You:
> I don't think it's accurate to describe this article as being about actors
So the article says it is about actors. It says it is about messaging, it certainly is about asynchronous messaging, and you even agree that io_uring is an asynchronous message queue.
In what way is the article not about a connection between actors and io_uring?
This is what I mean when I write that you are changing what the article is about. It is about this: asynchronous messaging/actors.
It may not go into a lot of depth about that connection, but it clearly is about it. And it may be wrong to focus on this. It may be wrong in how it describes it. But you cannot claim that it is about something else.
2. You:
> it's not about schedulers
Yes. And? Why does an article showing the connection between a general concept of asynchronous messaging (actors) and a specific instance of asynchronous messaging (io_uring) have to be "about" schedulers?
Please don't answer, it is rhetorical question.
3. You:
> but it's about asynchronous I/O
This is where you actually do the change. No: it is not about asynchronous I/O (in general). It is about an asynchronous messaging interface to I/O. Not the same thing. At all.
Once again, maybe you think it should be about this topic instead. And maybe you are even right that it should be about this (I don't think that's the case). But even if you were right that it should be about this other topic, you are not free to claim that it is about this other topic, when it clearly is not.
4. You:
> Asynchronous I/O completion notification was a huge innovation, ...
> But don't try to sell asynchronous I/O as a "game-changing" paradigm shift...
That's where you criticize the article for the thing you made up that it should be about, but is not. The article is not even about asynchronous I/O in general at all, never mind trying to sell asynchronous I/O as anything. It is talking about messaging, the fact that you can regard io_uring_sqe as a message and the submission and completion queues as message queues. Yielding something that's roughly equivalent to (some version of) the Actor model.
The telephone and electrical power networks were vast in scope (and still are), enabling interstate communication and power utilities. Echoes of the transportation utilities enabled through railroads. Multics was architected partially with the commercial goal of scaling up with users, a computing utility. But in a time with especially expensive memory, a large always resident kernel was a lot of overhead. The hardware needed a lot of memory and would be contending with some communication network whose latency could not be specified at the OS design time. Ergo, asynchronous I/O was key.
Put differently, Multics bet that computing hardware would continue to be expensive enough to be centralized, thereby requiring a CPU to contend with time-sharing across various communication channels. The CPU would be used for compute and scheduling.
Unix relaxed the hardware requirements significantly at the cost of programmer complexity. This coincided roughly with lower hardware costs, favoring compute (in broad strokes) over scheduling duties. The OS should get out of the way as much as possible.
After a bunch of failed grand hardware experiments in the 1980s, the ascendant Intel rose with a dominant but relatively straightforward CPU design. Designs like the Connection Machine were distilled into Out of Order Execution, a runtime system that could extract parallelism while contending with variable latency induced by the memory subsystem and variable instruction ordering. Limited asynchronous execution mostly hidden away from the programmer until more recently with HeartBleed.
Modern SoCs encompass many small cores, each running a process or maybe an RTOS, along with multiple CPU cores, many GPU cores, SIMD engines, signal processing engines, NPU cores, storage engines, etc. A special compute engine for all seasons, ready to be configured and scheduled by the CPU OS, but whose asynchronous nature (a scheduling construct!) no longer hidden from the programmer.
I think the article reflects how even on a single computer, the duty of the CPU (and therefore OS) has tilted in some cases towards scheduling over compute for the CPU. And of course, this is without considering yet cloud providers, the spiritual realization of a centralized computing utility.
(One quibble: I think when you said "HeartBleed" you meant Meltdown and Spectre.)
I think there have always been significant workloads that mostly came down to routing data between peripherals, lightly processed if at all. Linux's first great success domains in the 90s were basically routing packets into PPP over banks of modems and running Apache to copy data between a disk and a network card. I don't think that's either novel or an especially actors-related thing.
To the extent that a computational workload doesn't have a major "scheduling" aspect, it might be a good candidate for taking it off the CPU and putting it into some kind of dedicated logic—either an ASIC or an FPGA. This was harder when the PDP-11 and 6502 were new, but now we're in the age of dark silicon, FPGAs, and trillion-transistor chips.
Much more of a slowdown than that. Mmap a big file on an SSD, access its bytes randomly, and you can get your 500MB/s read speed down to kilobytes/s.
> We need interfaces designed for zero-copy access to bulk persistent data.
This is in tension with caching. For it to work, you'd need to get system builders to stop bundling smaller&faster storage with their slower&larger storage.
I think the article fell off here. Linux has a profoundly synchronous I/O system. The I/O U-Ring works there by dispatching requests to a kernel-maintained pool of worker threads. There the submitted request are run synchronously.
So the I/O U-Ring is in fact the abstraction, "a feeble attempt to impose your [new] mental model onto an [old] reality", as the author might put it. The actual Linux I/O system is most trivially exposed to userland by the same old "1970s Unix" means of the traditional system calls.
The real change with the I/O U-Ring is that it offers a way to submit work in bulk and finally introduces to Linux a form of I/O that appears to the program to be truly asynchronous.
(And it's not without some fair cause that file I/O is implemented as synchronous code. It's hard to do it asynchronously when it's a complex operation with many steps, where you are interacting with a page cache, not only for file contents but even caching the very metadata that describes how to get from an offset into a file to the block numbers where the data is stored. Windows has a famously profoundly asynchronous I/O system and even there actual file I/O is done just the same with synchronous logic ran in a worker thread, or copied directly from the cache if the requested data is already there.)
You're right of course that there are lots of cases (missing metadata, synchronous operation like extending files, ...) where it's all offloaded to the wq.
Upon receipt of this message in the event E, the target consults its script (the actor analogue of program text), and using its current local state and the message as parameters, sends new messages to other actors and computes a new local state for itself.
It doesn't say anything about whether you do:
- nginx-style state machines in C
- callbacks in C++, or C++ 20 coroutines
- async/await in Rust
- Goroutines in Go
- async/await in Python or JS, with garbage collection
etc. I don't think the "actor model" really means that much these days.What's a "canonical" and successful actor model program? What can we learn from such programs?
I think if you ask 5 people you'll get 5 different answers.
---
Also, with
__u8 opcode; /* type of operation for this sqe */
__s32 fd; /* file descriptor to do IO on */
then you have lost all static typing. It is too low level, so the analogy doesn't really hold up IMO.Also, I don't understand why it's "do files want to be actors?", not "do Unix PROCESSES want to be actors?"
(copy of lobste.rs comment)
Do you need it at this level? At some point everything is a bit-field. We impose typing to aid our mental models and build useful abstractions.
When interacting with the kernel we can let go, then reclaim, our types
"an operating system is a collection of things that don't fit inside a language; there shouldn't be one"
-- Dan Ingalls
So what happens is that those runtimes built on top of whatever low level primitives are available, and that is about it.
Even considering UNIX alone, many ways to do asynchronous IO aren't even part of POSIX, it has remained specific to each UNIX flavour.
To some extent, UNIX/POSIX API surface has been the C and C++ standard library that WG14 and WG21 didn't want to take over into ISO, but almost every C and C++ developer expects to exist anyway.
Seems to me that it's a matter of perspective, specifically about who you're sending messages to.
You can consider the message as being sent to the file (descriptor).
You can also consider it a message sent to the kernel, in which case the kernel is an actor and the file is a passive data abstraction.
Both are accurate and useful; which one makes more sense will depend on context and your own background.
Naturally considering Erlang, and any other language with rich runtime and ecosystem as well.
How do they do syscalls? Is every one basically an IPC?
QNX and se4L are two of the fastest ones, as general purposes OSes. Then you have embedded ones for high integrity computing like INTEGRITY RTOS.
Here some info,
https://swd.de/Support/Documents/Manuals/Neutrino-Microkerne...
https://docs.sel4.systems/Tutorials/#sel4-mechanisms-tutoria...
https://www.researchgate.net/publication/386549964_An_Overvi...
If the caller uses asynchronous native APIs to perform ipc instead of posix APIs then everything can be non blocking.
https://en.wikipedia.org/wiki/Grand_Central_Dispatch
(which uses pthreads)
Also it isn't regular pthreads on OS X, rather Apple own's flavour.
Described on the Internals section.