I must be missing something - isn’t Go supposed to be memory efficient? Perhaps promises and goroutines aren’t comparable?
This probably doesn't matter in the real world as you will actually use the tasks to do some real work which should really dwarf the static overhead is almost all cases.
Am I wrong?
UPDATE: From the replies below, it looks like I was right about "no concurrency takes place", but I was wrong about how long it takes, because `tokio::time::sleep()` keeps track of when the future was created, (ie when `sleep()` was called) instead of when the future is first `.await`ed (which was my unsaid assumption).
[1]: https://docs.rs/tokio/1.41.1/src/tokio/time/sleep.rs.html#12...
I’m not a Rust programmer but I strongly suspect this updated explanation is erroneous. It’s probably more like this: start time is recorded when the task execution is started. However, the task immediately yields control back to the async loop. Then the async loop starts another task, and so on. It’s just that the async loop only returns the control to sleeping task no earlier than the moment 1s passes after the task execution was initialy started. I’d be surprised if it had anything to do with when sleep() was called.
On the other hand, I maintain that this is an incidental rather than essential reason for the program finishing quickly. In that benchmark code, we can replace "sleep" with our custom sleep function which does not record start time before execution:
async fn wrapped_sleep(d: Duration) {
sleep(d).await
}
The following program will still finish in ~10 seconds. #[tokio::main]
async fn main() {
let num_tasks = 100;
let mut tasks = Vec::new();
for _ in 0..num_tasks {
tasks.push(wrapped_sleep(Duration::from_secs(10)));
}
futures::future::join_all(tasks).await;
}
Regardless, the program you provided _does_ actually run the futures concurrently, because of the `join_all()`. My point above was that in the original blog post, the appendix has a version without `join_all()`, which has no concurrency.
note absolute numbers here: in the worst case, 1M tasks consumed 2.7 GB of RAM, with ~2700 bytes overhead per task. That'd still fit in the cheapest server with room to spare.
My conclusion would be opposite: as long as per-task data is more than a few KB, the memory overhead of task scheduler is negligible.
That's kind of the Achille's Heel of the benchmark. Any business needing to spawn 1 million tasks, certainly wants to do something on them. It's the "do something on them" part that usually leads to difficulties for these things. Not really the "spawn a million tasks" part.
Like say you are making a server, and each client has 16KB of state. Then memory usage would be 17KB in Node vs 19 KB in Go. Smaller? Yes. Smaller enough that you want to rewrite the whole app? Probably not.
Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.
Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.
Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.
I have no idea what Go is doing to get 2500 bytes per task.
TFA creates a goroutine (green thread) for each task (using a waitgroup to synchronise them). IIRC goroutines default to 2k stacks, so that’s about right.
One could argue it’s not fair and it should be timers which would be much lighter. There’s no “efficient wait” for them but that’s essentially the same as the appendix rust program.
If that memory isn't being used and other things need the memory then the OS will very quickly dump it into swap, and as it's never being touched the OS will never need to bring it back in to physical memory. So while it's allocated it doesn't tie up the physical RAM.
It’s not memory that’s consumed by the runtime, it’s memory the runtime expects the program to use - it’s just that this program does no useful work.
In the Go number reported, the majority of the memory is the stack Go allocated for the application code anticipating processing to happen. In the Node example, the processing instead will need heap allocation.
Point being that the two numbers are different - one measures just the overhead of the runtime, the other adds the memory reserved for the app to do work.
The result then looks wasteful for Go because the benchmark.. doesn’t do anything. In a real app though, preallocating stack can often be faster than doing just-in-time heap allocation.
Not always of course! Just noting that the numbers are different things; one is runtime cost, one is runtime cost plus an optimization that assumes memory will be needed for processing after the sleep.
I agree that it would also be interesting to benchmark some actual stack or heap usage and how the runtimes handle that, but if you are running a massively parallel app you do sometimes end up scheduling jobs to sleep (or perhaps, more likely, to prepare to act but they never do and get cancelled). So I think this is a valid concern, even though it's not the only thing that amtters.
If a process allocates many pages of virtual memory, but never actually reads or writes to that memory, then it's unlikely that any physical memory backs those pages. In this sense, allocating memory is really just bookkeeping in the operating system. It's when you try to read or write that memory that the operating system will actually allocate physical memory for you.
When you first try to access the virtual memory you've allocated, there will be a page fault, causing the OS to determine if you're actually allowed to read or write to it. If you've previously allocated it, then all is good, and the OS will allocate some physical memory for you, and make your virtual memory pointers point to that physical memory. If not, well, then that's a segfault. It's not until you first try to use the memory that you actually consume RAM.
We, as devs, have "4" such resources available to us, memory, network, I/O and compute. And it behooves us to not prematurely optimize on just one.
[0] I can see more arguments/discussions now, "2K is too low, it should be 2MB" etc...!
I guess that’s true.
And to be clear, I do agree with the top comment (which seems to be by you), TFA uses timers in the other runtimes and go does have timers so using goroutines is unwarranted and unfair. And I said as much a few comment up (although I’d forgotten about AfterFunc so I’d have looped and waited on timer.After which would still have been a pessimisation).
And after thinking more about it the article is in also outright lying: technically it’s only measuring tasks in Go, timers are futures / awaitables but they’re not tasks: they’re not independently scheduled units of work, and are pretty much always special cased by runtimes.
2k stacks are an interesting design choice though... presumably they're packed, in which case stack overflow is a serious concern. Most threading systems will do something like allocating a single page for the stack but reserving 31 guard pages in case it needs to grow.
In reality it does use a guard area (technically I think it's more of a redzone? It doesn't cause access errors and functions with known small static frames can use it without checking).
I guess if you're already doing garbage collection moving the stack doesn't make things all that much worse though... still, yuck.
And it’s probably not the worst issue because deep stacks and stack pointers will mostly be relevant for long running routines which will stabilise their stack use after a while (even if some are likely subject to threshold effects if they’re at the edge, I would not be surprised if some codebases ballasted stacks ahead of time). Also because stack pointers will get promoted to the heap if they escape so the number of stack pointers is not unlimited, and the pointer has to live downwards on the stack.
NodeJS has one thread with a very tight loop.
Go actually spawned 1M green threads.
Honestly this benchmark is just noise. Not to say useless in most real world scenarios. Specially because each operation is doing nothing. It would be somewhat useful if they were doing some operation like a DB or HTTP call.
For example, for node, the author puts a million promises into the runtime event loop and uses `Promise.all` to wait for them all.
This is very different from, say, the Go version where the author creates a million goroutines and puts `waitgroup.Done` as a defer call.
While this might be the idiomatic way of concurrency in the respective languages, it does not account for how goroutines are fundamentally different from promises, and how the runtime does things differently. For JS, there's a single event loop. Counting the JS execution threads, the event loop thread and whatever else the runtime uses for async I/O, the execution model is fundamentally different from Go. Go (if not using `GOMAXPROCS`) spawns an OS thread for every physical thread that your machine has, and then uses a userspace scheduler to distribute goroutines to those threads. It may spawn more OS threads to account for OS threads sleeping on syscalls. Although I don't think the runtime will spawn extra threads in this case.
It also depends on what the "concurrent tasks" (I know, concurrency != parallelism) are. Tasks such as reading a file or doing a network call are better done with something like promises, but CPU-bound tasks are better done with goroutines or Node worker_threads. It would be interesting to see how the memory usage changes when doing async I/O vs CPU-bound tasks concurrently in different languages.
But I do think that spawning a goroutine just to do a non-blocking task and get its return is kinda wasteful.
func test2(count int) {
timers := make([]*time.Timer,count)
for idx, _ := range timers {
timers[idx] = time.NewTimer(10 * time.Second)
}
for idx, _ := range timers {
<-timers[idx].C
}
}
This yields to 263552 Maximum resident set size (kbytes) according to /usr/bin/time -vI'm not sure if I missed it, but I don't see the benchmark specify how the memory was measured, so I assumed the time -v.
Of course each language will have a different way of achieving this task each of which will have their unique pros/cons. That's why we have these different languages to begin with.
The way the results are presented a reader may think the Go memory usage sounds equivalent to the others - boilerplate, ticket-to-play - and then the Go usage sounds super high.
But they are not the same; that memory is in anticipation of a real world program using it
Point being: Someone reading this to choose which runtime will fit their use case needs to be carefully to not assume the numbers measure the same thing. For some real world use cases the pre allocated stack will perform better than the runtimes that instead will do heap allocations.
Apart from i/o, allocating memory is usually the slowest thing you can do on a computer, in my experience.
That's not a real requirement though. No business actually needs to run 1 million concurrent tasks with no concern for what's in them.
Go heavily encourages a certain kind of programming; JavaScript heavily encourages a different kind; and the article does a great job at showing what the consequences are.
True, but it really doesn't encourage you to run 1m goroutines with the standard memory setting. Though it's probably fair to run Go wastefully when you're comparing it to Promise.All.
Well... I'm actually not sure what ideomatic means (English isn't my first language), but it's the standard way of doing it. You'll even find it as step 2 and 3 here: https://go.dev/tour/concurrency/1
> or the best way you can imagine
I would do a lot much more to tune it if you were in a position where you'd know it would run that many "tasks". I think what many non-Go programmers might run into here is that Go doesn't come with any sort of "magic". Instead it comes with a highly opinionated way of doing things. Compare that to C# which comes with a highly optimized CLR and a bunch really excellent libraries which are continuously optimized by Microsoft and you're going to end up with an article like this. The async libraries are maintaining which tasks are running (though Promise.All is obviously also binding a huge amount of memory you don't have to), while the Go example is running 1 million at once.
You'll also notice that there is no benchmark for execution time. With Go you might actually want to pay with memory, though I'd argue that you'd almost never want to run 1 million Goroutines at once.
Though to be fair to this specific author, it looks like they copied the previous benchmarks and then ran it as-is.
It's not part of the actual documentation either, at least not exactly: https://go.dev/doc/effective_go#concurrency You will achieve much the same if you follow it, but my answer should have been yes and no as far as being the "standard" Go way.
As far as practicality goes I actually agree with you: if I knew I were trying to do something to the order of 1,000,000 tasks in Go I would probably use a worker pool for this exact reason. I have done this pattern in Go. It is certainly not unidiomatic.
However, it also isn't the obvious way to do 1,000,000 things concurrently in Go. The obvious way to do 1,000,000 things concurrently in Go is to do a for loop and launch a Goroutine for each thing. It is the native unit of task. It is very tightly tied to how I/O works in Go.
If you are trying to do something like a web server, then the calculus changes a lot. In Go, due to the way I/O works, you really can't do much but have a goroutine or two per connection. However, on the other hand, the overhead that goroutines imply starts to look a lot smaller once you put real workloads on each of the millions of tasks.
This benchmark really does tell you something about the performance and overhead of the Go programming language, but it won't necessarily translate to production workloads the way that it seems like it will. In real workloads where the tasks themselves are usually a lot heavier than the constant cost per task, I actually suspect other issues with Go are likely to crop up first (especially in performance critical contexts, latency.) So realistically, it would probably be a bad idea to extrapolate from a benchmark this synthetic to try to determine anything about real world workloads.
Ultimately though, for whatever purpose a synthetic benchmark like this does serve, I think they did the correct thing. I guess I just wonder exactly what the point of it is. Like, the optimized Rust example uses around 0.12 KiB per task. That's extremely cool, but where in the real world are you going to find tasks where the actual state doesn't completely eclipse that metric? Meanwhile, Go is using around 2.64 KiB per task. 22x larger than Rust as it may be, it's still not very much. I think for most real world cases, you would struggle to find too many tasks where the working set per task is actually that small. Of course, if you do, then I'd reckon optimized async Rust will be a true barn-burner at the task, and a lot of those cases where every byte and millisecond counts, Go does often lose. There are many examples.[1]
In many cases Go is far from optimal: Channels, goroutines, the regex engine, various codec implementations in the standard library, etc. are all far from the most optimal implementation you could imagine. However, I feel like they usually do a good job making the performance very sufficient for a wide range of real world tasks. They have made some tradeoffs that a lot of us find very practical and sensible and it makes Go feel like a language you can usually depend on. I think this is especially true in a world where it was already fine when you can run huge websites on Python + Django and other stacks that are relatively much less efficient in memory and CPU usage than Go.
I'll tell you what this benchmark tells me really though: C# is seriously impressive.
[1]: https://discord.com/blog/why-discord-is-switching-from-go-to...
> I'll tell you what this benchmark tells me really though: C# is seriously impressive.
The C# team has done some really great work in recent years. I personally hate working with it and it's "magic", but it's certainly in a very good place as far as trusting the CLR to "just work".
Hilariously I also found the Python benchmark to be rather impressive. I was expecting much worse. Not knowing Python well enough, however, makes it hard to really "trust" the benchmark. A talented Python team might be capable of reducing memory usage as much as following every step of the Go concurrency tour would for Go.
If you do not like the aesthetics of C# and find Elixir or OCaml family tolerable - perhaps try F#? If you use task CEs there you end up with roughly the same performance profile and get to access huge ecosystem making it one of the few FP languages that can be used in production with minimal risk.
I don't think C# does it at no cost. I think it's "attachment" to Clean Code makes most C# code bases horrible messes after a while. I know this is a preference thing and that many people will disagree, but I've seen C# code bases that were so complicated to work with that they were actively hindering the development teams ability to meet the business needs. You don't have to write C# that way, but that's what happens in almost every company where I live.
> If you do not like the aesthetics of C# and find Elixir or OCaml family tolerable - perhaps try F#? If you use task CEs there you end up with roughly the same performance profile and get to access huge ecosystem making it one of the few FP languages that can be used in production with minimal risk.
I mean, I don't think I'll ever have to work within the dotnet ecosystem. The way things are going in the green energy and finance sector which is where my career have taken me I'll mostly get to work with Python (with C/Zig) or Go and possibly Java. C# and dotnet is almost exclusively used at stagnant small-medium sized companies and in the consultance business servicing these companies. This is not because of C# or dotnet but more because of the developer landscape. Java is big in "older" organisations because it's what was taught in universities and because it was always good, Go is replacing C#/Java in a lot of newer companies because there are a lot of success stories around it and a lot of the Java developers are retiring. Python is growing really big because a lot of non-swe engineers and accountant types are using it as well as how it's used in ML/AI/Datawarehouse. PHP is big in the web-shop industry and so on. C# manly made it's way into business at places which ran a lot of windows servers. Since organisations rarely change tech stacks in the more "boring" parts of the world, it's not likely to change much.
I don't think dotnet or C# are bad. I write some powershell for azure automation to help IT operations from time to time, but I really don't like working with C# (or Java). I would personally like to work with Rust or more Zig at some point, but it's not like anyone is adopting Rust around here and while Zig can be used for some things in place of C it's not really "production ready" for most things.
It was probably the intent of the parent to mean 'making use of the particular features of the language that are not necessarily common to other languages'.
I'm not a programmer, but you appear to give good examples.
I hope I'm not teaching you to suck eggs... {That's an idiom, meaning teaching someone something they're already expert in. Like teaching your Grandma to suck eggs - which weirdly means blowing out the insides of a raw egg. That's done when using the egg to paint; which is a traditional Easter craft.}
I think better comparison would be wasting CPU for 10 seconds instead of sleep.
Instead, there's usually going to be some queue outside the VM that will leave you with _some_ sort of chunking and otherwise working in smaller, more manageable bits (that might, incidentally, be shaped in ways that the VM can handle in interesting ways).
It's definitely true to say that the "idioamatic" way of handling things is worth going into, but if part of your synthetic benchmark involves doing something quite out of the ordinary, it feels suspicious.
I generally agree that a "real" benchmark here would be nice. It would be interesting if someone could come up with the "minimum viable non-trivial business logic" that people could use for these benchmarks (perhaps coupled with automation tooling to run the benchmarks)
But neither would you wait on a waitgroup of size 1 million in Go... right?
But if I have 1 million tasks which spent 10% of their time on CPU-bound codes, intermixed with other IO-bound codes, and I just want throughput and I'm too lazy to use a proper task queue, then why not?
But I can go further than that: No professional programmer should run 1M concurrent tasks on an ordinary CPU no matter which language because it makes no sense if the CPU has several orders of magnitudes less cores. The tasks are not going to run in parallel anyway.
[1] https://www.freecodecamp.org/news/million-websockets-and-go-...
And we're talking about a naive microbenchmark. If you were actually building a service like that in Go (millions of active connections) and you were very concerned about memory usage, you wouldn't be naive enough to use a goroutine for every connection. Instead, you would use something like gnet or another solution based directly on epoll events, combined with a worker pool.
In the titular post there's a link to a previous comparison between approaches, and plain OS threads used from Rust fare quite well, even if the author doesn't up the OS limits to keep that in the running for the higher thread cases: https://pkolaczk.github.io/memory-consumption-of-async/
$ cat /proc/sys/kernel/pid_max
4194304
My computer can handle that many processes, after that no new processes can be spawned (see: forkbomb)What’s missing here is that all these async/await, lightweight threads, etc features exist because presumably because OS processes and threads consume too many resources to feasibly serve a similar role.
However nobody seems to have any hard numbers about the topic. How bad is a Linux context switch? How much memory does a process use? Suddenly everyone is a micro optimizer seeking to gain a small constant multiple.
Since this is not ready at hand I suspect the rewards are much less clear. It’s more likely that the languages benefit from greater cross platform control and want to avoid cross platform inconsistencies.
* gunicorn doesn't have worker scaling, they're all always running, so async workers were to not waste those resources idling while still allowing lots of simultaneous clients
* The limit was 32768 not that long ago, which was at least possible to hit with a ton of clients
Note 1: The gist is in Ukrainian, and the blog post by Steve does a much better job, but hopefully you will find this useful. Feel free to replicate the results and post them.
Note 2: The absolute numbers do not necessarily imply good/bad. Both Go and BEAM focus on userspace scheduling and its fairness. Stackful coroutines have their own advantages. I think where the blog post's data is most relevant is understanding the advantages of stackless coroutines when it comes to "highly granular" concurrency - dispatching concurrent requests, fanning out to process many small items at once, etc. In any case, I did not expect sibling comments to go onto praising Node.js, is it really that surprising for event loop based concurrency? :)
Also, if you are an Elixir aficionado and were impressed by C#'s numbers - know that they translate ~1:1 to F# now that it has task CE, just sayin'.
Here's how the program looks in F#:
open System
open System.Threading.Tasks
let argv = Environment.GetCommandLineArgs()
[1..int argv[1]]
|> Seq.map (fun _ -> Task.Delay(TimeSpan.FromSeconds 10.0))
|> Task.WaitAll
Souce: https://github.com/microsoft/referencesource/blob/master/msc...
Please note that the link above leads to old code from .NET Framework.
The up-to-date .NET implementation lives here: https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
https://github.com/dotnet/runtime/blob/1f01bee2a41e0df97089f...
var cnt = int.Parse(args[0]);
var evt = new CountdownEvent(cnt);
for (var i = 0; i < cnt; i++) {
async Task Execute() {
await Task.Delay(TimeSpan.FromSeconds(10));
evt.Signal();
}
_ = Execute();
}
evt.Wait();
It ends up consuming roughly 264.5 MB on ARM64 macOS 15.1.1 (compiled with NativeAOT).But, then they can measure memory by simply using a threat pool of size 1 and then submitting tasks to it right ? That would be the equivalent comparison for other languages.
They should launch a million NodeJS processes.
Well, if it isn't the classic unwavering confidence that an artificial "hello world"-like benchmark is in any way representative of real world programs.
> Some folks pointed out that in Rust (tokio) it can use a loop iterating over the Vec instead of join_all to avoid the resize to the list
Right, but some folks also pointed out you should've used an array in Java in the previous blog post, 2 years ago, and you didn't do that.
And folks also pointed out Elixir shouldn't have used Task in the previous benchmark (folk here being the creator of Elixir himself): https://github.com/pkolaczk/async-runtimes-benchmarks/pull/7
However, I don't think that underlying array is resized every time `add` is called. I'd expect that resize will happen less than 30 times for 1M adds (capacity grows geometrically with a=10 and r=1.5)
Given the linear time time complexity, it seems obvious that adding a thread pointer to the list won't contribute substantially to the thread creation time.
The way these kinds of operations can be implemented for amortized linear time (also in Python, C realloc, etc) is explained in https://en.wikipedia.org/wiki/Dynamic_array#Geometric_expans...
The cost wouldn't be just Memory because the network card and CPU also enter the game.
Rust's `join_all` uses `FuturesUnordered` behind the scenes, which is pretty intelligent in terms of keeping track of which tasks are ready to make progress, but it does not use tokio/async_std for scheduling. AFAICT the only thing being measured about tokio/async_std is the heap size of their `sleep` implementations.
I'd be very interesting in seeing how Tokio's actual scheduler performs. The two ways to do that are:
- using https://docs.rs/tokio/latest/tokio/task/join_set/struct.Join... to spawn all the futures and then await them
- spawn each future in the global scheduler, and then await the JoinHandles using the for loop from the appendix
As other commenters have noted, calling `sleep` only constructs a state machine. So the Appendix isn't actually concurrent. Again, you need to either put those state machines into the tokio/async_std schedulers with `spawn`, or combine the state machines with `FuturesUnordered`.
I guess some FF bug then.
Thanks.
setTimeout[promisify.custom] === require('node:timers/promises').setTimeout
You could of course manually wrap `setTimeout` yourself as well: const sleep = n => new Promise(resolve => setTimeout(resolve, n))
[1] https://nodejs.org/docs/latest-v22.x/api/util.html#utilpromi...This is a sample of 1 usecase, (so questionable real-worldness) but the difference is really eye-opening. Congrats to the C# team!
The JIT compiler that microsoft created has been nothing short of amazing.
Consider the following code:
package main
import ( "fmt" "os" "strconv" "time" )
func main() {
numTimers, _ := strconv.Atoi(os.Args[1])
timerChan := make(chan struct{})
// Goroutine 1: Schedule timers
go func() {
for i := 0; i < numTimers; i++ {
timer := time.NewTimer(10 * time.Second)
go func(t *time.Timer) {
<-t.C
timerChan <- struct{}{}
}(timer)
}
}()
// Goroutine 2: Receive and process timer signals
for i := 0; i < numTimers; i++ {
<-timerChan
}
}Also for Node it's weird not to have Bun and Deno included. I suppose you can have other runtimes for other languages too.
In the end I think this benchmark is comparing different things and not really useful for anything...
I urge anyone making decisions from looking at these graphs to run this benchmark themselves and add two things:
- Add at least the most minimal real world task inside of these function bodies to get a better feel for how the languages use memory
- Measure the duration in addition to the memory to get a feel for the difference in scheduling between the languages
Processes in Erlang, Goroutines in Go and Virtual Threads in Java do not fully replace lightweight asynchronous state machines - many small highly granular concurrent operations is their strength.
[0]: https://gist.github.com/neon-sunset/8fcc31d6853ebcde3b45dc7a... (disclaimer: as pointed out in a sibling comment it uses Elixir's Task abstraction which adds overhead on top of the processes)
import kotlin.time.Duration.Companion.milliseconds
import kotlin.time.measureTime
import kotlinx.coroutines.async
import kotlinx.coroutines.awaitAll
import kotlinx.coroutines.coroutineScope
import kotlinx.coroutines.delay
suspend fun main() {
measureTime {
coroutineScope {
(0..1000000).map {
async {
delay(1.milliseconds)
}
}.awaitAll()
}
}.let { t ->
println("Took $t")
val runtime = Runtime.getRuntime()
val maxHeapSize = runtime.maxMemory()
val allocatedHeapSize = runtime.totalMemory()
val freeHeapSize = runtime.freeMemory()
println("Max Heap: ${maxHeapSize / 1024 / 1024} MB")
println("Allocated Heap: ${allocatedHeapSize / 1024 / 1024} MB")
println("Free Heap: ${freeHeapSize / 1024 / 1024} MB")
}
}
This produces the following output: Took 1.597011084s
Max Heap: 4096 MB
Allocated Heap: 2238 MB
Free Heap: 1548 MB
So whatever is needed to load classes and a million co-routines with some heap state. Of course the whole thing isn't doing any work and this isn't much of a benchmark. And of course if I run it with kotlin-js it actually ends up using promises. So, it's not going to be any better there than on the JVM. async function main() {
const numTasks = parseInt(process.argv[2], 10);
const taskDuration = 10000; // 10 seconds
const tasks = Array.from({ length: numTasks }, () =>
new Promise(resolve => setTimeout(resolve, taskDuration))
);
await Promise.all(tasks); // Wait for all tasks to resolve
console.log("All tasks completed.");
}
main().catch(err => {
console.error("Error occurred:", err);
});
Go won because it served a need felt by many programmers: a garbage-collected language which compiled to native code, with robust libraries supported by a large corp.
With Native AOT, C# is walking into the same space. With arguably better library selection, equivalent performance, and native code compilation. And a much more powerful, well-thought-out language - at a slight complexity cost. If you're starting a project today (with the luxury of choosing a language), you should give C# + NativeAOT a consideration.
It's nice to have that stuff when you know the language, but it does make the learning curve steeper and it can be a bit annoying when working in a team.
Even after 4 years of using it professionally I still see code some times that uses obscure syntax I had no idea existed. I would describe C# as a language for experts. If you know what you're doing it's an amazing language, maybe actually the best current programming language. But learning and understanding everything is a monumental task, simpler languages like go or Java can be learned much faster.
The code for several of the languages that are low-memory usage that do the second while the high memory usage results do the first. For example, on my machine the article's go code uses 2.5GB of memory but the following code uses only 124MB. That difference is in-line with the rust results.
package main
import (
"os"
"strconv"
"sync"
"time"
)
func main() {
numRoutines, _ := strconv.Atoi(os.Args[1])
var wg sync.WaitGroup
for i := 0; i < numRoutines; i++ {
wg.Add(1)
time.AfterFunc(10*time.Second, wg.Done)
}
wg.Wait()
}
Even at 100k tasks the bottleneck is going to be the network stack (sending outgoing 400k RPS takes a lot of CPU and syscall overhead, even with SocketAsyncEngine!).
Doing so in Go would require either spawning Goroutines, or performing scheduling by hand or through some form of aggregation over channel readers. Something that Tasks make immediately available.
The concurrency primitive overhead becomes more important if you want to quickly interleave multiple operations at once. In .NET you simply do not await them at callsite until you need their result later - this post showcases how low the overhead of doing so is.
for (n=0;n<10;n++) { sleep(1 second); }
Changes the results quite a bit: for some reasons java use a _lot_ more memory and takes longer (~20 seconds), C# uses more that 1GB of memory, while python struggles with just scheduling all those tasks and takes more than one minute (beside taking more memory). node.js seems unfazed by this change.
I think this would be a more reasonable benchmark
This (AOT-compiled) F# implementation peaks at 566 MB with WKS GC and 509 MB with SRV GC:
open System
open System.Threading
open System.Threading.Tasks
let argv = Environment.GetCommandLineArgs()
[1..int argv[1]]
|> Seq.map (fun _ ->
task {
let timer = PeriodicTimer(TimeSpan.FromSeconds 1.0)
let mutable count = 10
while! timer.WaitForNextTickAsync() do
count <- count - 1
if count = 0 then timer.Dispose()
} :> Task)
|> Task.WaitAll
To Go's credit, it remains at consistent 2.53 GB and consumes quite a bit less CPU.We're really spoiled with choice these days in compiled languages. It takes 1M coroutines to push the runtime and even at 100k the impact is easily tolerable, which is far more than regular applications would see. At 100K .NET consumes ~57 MB and Go consumes ~264 MB (and wins at CPU by up to 2x).
And the errors are a feature — I learn the most from the errata!
SCO: "thread creation is about a thousand times faster than on native Linux"
linux kernel mailing list thread where Linus Torvalds replies Talk is cheap. Show me the code.
https://lkml.org/lkml/2000/8/25/132Also some other languages seem to be missrepresented.
Seems like someone had good intentions but no idea about the languages he tried to compare and the result is this article.
For me Python worked surprisingly well, while Go was surprisingly high on memory consumption.
I believe each task in Julia has its own stack, so this makes sense. Still, it does mean you've got to take account of ~16 KB of memory per running task which is not great.
The rust code is really checking how big Tokio's structures that track timers are. Solving the problem in a fully degenerate manner, the following code runs correct correctly and uses only 35MB peak. 35 bytes per future seems pretty small. 1 billion futures was ~14GB and ran fine.
#[tokio::main]
async fn main() {
let sleep = SleepUntil {
end: Instant::now() + Duration::from_secs(10),
};
let timers: Vec<_> = iter::repeat_n(sleep, 1_000_000_0).collect();
for sleep in timers {
sleep.await;
}
}
#[derive(Clone)]
struct SleepUntil {
end: Instant,
}
impl Future for SleepUntil {
type Output = ();
fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
if Instant::now() >= self.end {
Poll::Ready(())
} else {
cx.waker().wake_by_ref();
Poll::Pending
}
}
}
Note: I do understand why this isn't good code, and why it solves a subtly different problem than posed (the sleep is cloned, including the deadline, so every timer is the same).The point I'm making here is that synthetic benchmarks often measure something which doesn't help much. While the above is really degenerate, it shares the same problems as the article's code (it just leans into problems much harder).
Note that Go and Java code are not doing the same! See xargon7 comment.
I feel this is so misleading. For example, by default after spawning, Erlang would have some memory preallocated for each process, so they don't need to ask the operation system for new allocations (and if you want to shrink it, you call hibernate).
Do something more real, like message passing with one million processes or websockets. Or 1M tcp connections. Because, the moment you send messages, here is when the magic happens (and memory would grow, the delay when each message is processed would be different in different languages).
Oh, and btw, if you want to do THAT in erlang, use timer:apply_after(Time, Module, Function, Arguments). Which would not spawn an erlang process, just would put the task to the timer scheduling table.
And Elixir was in the old article, and they implemented it all wrong. Sad.
List<Task> tasks = new List<Task>(numTasks);
https://learn.microsoft.com/en-us/dotnet/core/deploying/nati...
This is the code:
# /// script
# requires-python = ">=3.12"
# dependencies = ["uvloop"]
# ///
import asyncio
import sys
import uvloop
async def main(num_tasks):
tasks = []
for task_id in range(num_tasks):
tasks.append(asyncio.sleep(10))
await asyncio.gather(*tasks)
if __name__ == "__main__":
num_tasks = int(sys.argv[1])
# uvloop.run(main(num_tasks))
asyncio.run(main(num_tasks))
I ran it with 100k tasks: /usr/bin/time -l -p -h uv run async-memory.py 100000
On my M1 MacBook Pro, using asyncio reports (~170MB): 170835968 maximum resident set size
Using uvloop (~204MB): 204259328 maximum resident set size
I kept the `import uvloop` statement when just using asyncio so that both cases start in the same conditions.