Show HN: Whirlwind – Async concurrent hashmap for Rust

136 points by willothy 2 days ago | 54 comments

Hey HN, this is Will and David from Fortress (https://news.ycombinator.com/item?id=41426998).

We use a lot of async Rust internally, and created this library out of a need for an async-aware concurrent hashmap since there weren’t many available in the Rust ecosystem.

Whirlwind is a sharded HashMap with a fully asynchronous API. Just as dashmap is a replacement for std::sync::RwLock<HashMap>, whirlwind aims to be a replacement for tokio::sync::RwLock<HashMap>. It has a similar design and performance characteristics to dashmap, but seems to perform better in read-heavy workloads with tokio's green threading.

Benchmarks are in the readme! We used an asyncified version of dashmap's benchmark suite. The project is in a pretty early stage and I'm sure there are flaws, but I'm pretty happy with the performance.

There is some unsafe involved, but we run Miri in ci to (hopefully) catch undefined behavior well before it's in an actual release.

We'd appreciate any feedback! Thanks in advance :)

Sytten 2 days ago |
Looks interesting! We used quick-cache [1] for that purpose right now, might be interesting to add comparison with those types of Key-Value caching crates.
[1] https://github.com/arthurprs/quick-cache
willothy 2 days ago |
Good point, thanks! I'll look into adding that crate to the benchmarks.
olix0r 2 days ago |
> Just as dashmap is a replacement for std::sync::RwLock<HashMap>, whirlwind aims to be a replacement for tokio::sync::RwLock<HashMap>.
I'm curious about the practical benefits of handling a HashMap with an async interface. My long-standing understanding is that `tokio::sync::RwLock<HashMap>` is only useful when you'd want to hold the lock guard across another `await` operation; but when the lock guard is not held across an await, it is always preferable to use the synchronous version.
This would lead me to assume that same applies for dashmap--it should be sufficient for async use cases and doesn't need an async API unless we expect to be blocked, but the benchmarks indicate that whirlwind outperforms dashmap in various situations. Do you have a sense of where this blocking occurs in dashmap?
willothy 2 days ago |
The blocking mainly occurs due to contention - imo most of the performance gain comes from being able to poll the lock instead of blocking until it's available when acquiring locks on shards.
In all honesty I was quite surprised by the benchmarks as well though, I wouldn't expect that much performance gain, but in high-contention scenarios it definitely makes sense.
conradludgate a day ago |
Benchmarks will always look good when using a spin-lock like you seem to be using here https://github.com/fortress-build/whirlwind/blob/0e4ae5a2aba...
dwattttt a day ago |
I believe that doesn't indicate it's spinlocking; the `poll` API is specified to not block.
conradludgate a day ago |
> In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available
Immediately waking itself, the task is scheduled and will be polled again shortly. This creates the loop in which is checks if the lock is available. This has no effective difference compared to using hint::spin_loop() but in async.
A more traditional lock will use a queue in which the task will not be polled until it's likely available, rather than blindly trying on repeat.
dwattttt a day ago |
The code linked does the following:
- attempt to acquire an RWLock without blocking
- if it does, wake the task waiting for the lock
- if it doesn't, return "not ready yet"
The RWLock is straight from the standard lib; there's no loop and no spinning involved. Every single Future you look at will look like this, by specification they cannot block, only return "ready" (and wake the task), or return "not ready".
conradludgate a day ago |
By waking itself, the task goes back to the scheduler queue. It then returns pending, going back to the scheduler where it picks a new task. It will eventually pick the scheduled task again and poll the rwlock. This is a loop, even if it's a long-winded one, and is by all accounts spinning.
It really is not much different to a thread blocking on a loop, where the OS thread scheduler will pre-empt the thread, rescheduling it for later. In this case it's an async task instead of a thread, and cooperative instead of preemptive, but it's still unfair and not very effective use of scheduler resources
dwattttt a day ago |
Oh my mistake. I wonder what the benchmarks would look like without the wake loop
ibraheemdev a day ago |
> Every single Future you look at will look like this,
That's not true. A Future is supposed to schedule itself to be woken up again when it's ready. This Future schedules it to be woken immediately. Most runtimes, like Tokio, will put a Future that acts like this at the end of the run queue, so in practice it's not as egregious. However, it's unquestionably a spin lock, equivalent to back off with thread::yield.
phlip9 2 days ago |
Benches look promising! My main concern is validating correctness; implementing good concurrency primitives is always challenging. Have you looked into testing against a purpose-built concurrency model checker like tokio-rs/loom [1] or awslabs/shuttle [2]? IMO that would go a long way towards building trust in this impl.
[1] https://github.com/tokio-rs/loom [2] https://github.com/awslabs/shuttle
willothy 2 days ago |
Yep, someone suggested loom on our Reddit r/rust post as well - I'm actively working on that. Somehow I'd just never heard of loom before this.
judofyr 2 days ago |
One thing which wasn’t obvious to me from the benchmark: What’s the key distribution? A sharded map will probably have great performance on uniform keys, but in my experience it’s far more common to have power law distribution in real life scenarios.
It would be good to see a benchmark where it only touches a _single_ key. If Whirlwind is still fast than the others I would be far more convinced to use it unconditionally.
EDIT: I also see that you're benchmarking on a M3 Max. Note that this has 6 performance cores and 6 efficiency cores. This means that if you're running at <6 cores it will most likely start out running it at the efficiency core. From my experience it's quite important to do warmup phases in order to get stable results at low thread count. And even then I find it hard to reason about the results since you're running in mixed set of cores…
willothy 2 days ago |
Definitely a good point. I used dashmap's benchmark suite because it was already setup to bench many popular libraries, but I definitely want to get this tested in more varied scenarios. I'll try to add a benchmark for a single key only this week.
Regarding your edit: damn I hadn't thought of that. I'll rerun the benchmarks on my Linux desktop with a Ryzen chip and update the readme.
judofyr 2 days ago |
Very fascinating results though! The sharding approach is quite a simple way of doing more fine-grained locking, making sure that writes don't block all the reads. Pretty cool to see that it actually pays off even with the overhead of Tokie scheduling everything!
There might be some fairness concerns here? Since we're polling the lock (instead of adding ourselves to a queue) it could be the case that some requests are constantly "too late" to acquire it? Could be interesting to see the min/max/median/P99 of the requests themselves. It seems that Bustle only reports the average latency[1] which honestly doesn't tell us much more than the throughputs.
[1]: https://docs.rs/bustle/latest/bustle/struct.Measurement.html
willothy a day ago |
This is a great point too - I definitely want to run some more varied benchmarks to get a better idea of how this performs in different settings. We'll also be using it in prod soon, so we'll see how it does in a real use setting too :)
shepardrtc 2 days ago |
How do you determine the number of shards to use?
willothy 2 days ago |
I use a multiple of `std::thread::available_paralellism()`. Tbh I borrowed the strategy from dashmap, but I tested others and this seemed to work quite well. Considering making that configurable in the future so that it can be specialized for different use-cases or used in single-threaded but cooperatively scheduled contexts.
cogman10 2 days ago |
(std::thread::available_parallelism().map_or(1, usize::from) * 4).next_power_of_two()
gsliepen 2 days ago |
I'm sure this library does the thing it claims to do very nicely. However, as a programmer I am saddened that so many things that should have been in the standard library by now need to come from external repositories and have a weird and unintuitive names.
willothy 2 days ago |
Hey, I think the name is cool!
Fair point though, there would definitely be some benefit to having some of these things in the stdlib.
bryanlarsen 2 days ago |
tokio itself is not mature/stable enough to be in the standard library, IMO, let alone anything based on tokio.
rsanders 2 days ago |
I'm pretty new to Rust so forgive me if I'm mistaken, but it seems to me that this crate doesn't require the use of tokio.
lesuorac a day ago |
When all the examples are marked with `#[tokio::main]`, it probably requires tokio.
Although I guess they do implement Future on their own so it shouldn't need a specific runtime then.
https://github.com/fortress-build/whirlwind/blob/main/src/sh...
macawfish a day ago |
They're just using tokio as a dev dependency. You could use this with any async runtime
https://github.com/fortress-build/whirlwind/blob/main/Cargo....
willothy a day ago |
Tokio is just used for async tests and the examples, the crate doesn’t depend on any specific async runtime :)
ghotli 2 days ago |
I try to tell all new programmers that ask me for advice that keeping abreast of the words of tools that are available for use is a big part of the work and shouldn't be left out. If I quit my daily / weekly trawl of what's out there, I'd surely start to atrophy.
lesuorac a day ago |
I kind of like the "thin" stdlib approach tbh.
Rather than having 3-5 ways to do a say a GUI from the stdlib you have 0 and instead you can pick a dependency that suits your way.
That then boils down to even data structures, there's a bunch of trade-offs to be made where one is better than the other and if Rust had to maintain 30 different types of maps for each use case that seems like a waste of their time. Push that work off to the people that need those maps.
willothy a day ago |
I agree with this take a lot. I think having lots of custom implementations is inevitable for systems languages - the only reason why Rust is different is because Cargo/crates.io makes things available for everyone, where in C/C++ land you will often just DIY everything to avoid managing a bunch of dependencies by hand or with cmake/similar.
seanw444 a day ago |
I'm torn. I don't write Rust much at all, but I have been playing around with it a little here and there for the last couple years. When I first tried it out, I was shocked that almost nothing I was looking for could be found in the stdlib. I thought I was looking in the wrong place. Coming mostly from the Go world, it was jarring to me when it turned out that out no, it really just barely has a stdlib.
Now after on some pondering though, there is a strength in that the Rust team doesn't have to focus on maintaining such a large stdlib, as you said. I find myself using external libraries for many things in Go already, so why not just extend that to slightly more fundamental things with lots of variance too.
I think the largest downside with this approach, however, is that it's hard for me to know what the de facto, collectively-agreed-upon standard is for a given task.
schainks a day ago |
The teams who compile the standard libraries are still made up of people, and sometimes better libraries get written outside of those teams. It's just a natural consequence of the intense collaboration we have in this industry. In fact, I am GLAD things happen this way, because it shows how flexible and open minded folks are when it comes to such libraries.
Naming things is... well, you know...
James_K a day ago |
> This crate is in development, and breaking changes may be made up until a 1.0 release.
When will this happen? I imagine a lot of people who want use it might just forget about it if you say "do not use this until some unspecified point in the future".
willothy a day ago |
I think there's a pretty big difference between committing to semantic versioning and saying "do not use this until some unspecified point in the future." Maybe I'm just not clear enough in the note - I just mean that the API could change. But as long as a consumer doesn't use `version = "*"` in their Cargo.toml, breaking changes will always be opt-in and builds won't start failing if I release something with a big API change.
James_K a day ago |
Maybe I'm a bit weird, but I would never commit to using something if the person making it wasn't providing a consistent interface. It could well be different in your case, but as a general rule, a sub 1.0 version is software that isn't yet ready to be used. The vast vast majority of projects that say "you can use this, but we won't provide a consistent interface yet" end up either dying before they get to v1 or causing so much pain they weren't worth using. I can see this issue being especially bad in Rust, where small API changes can create big issues with lifetimes and such.
KolmogorovComp a day ago |
Great crate! Why use a hashmap instead of a Btreemap which is usually advised in rust?
aw1621107 a day ago |
> Why use a hashmap instead of a Btreemap which is usually advised in rust?
Is this actually the case? I can't say I've seen the same.
willothy a day ago |
There are a few reasons - For one, I'm not sure BTreeMap is always faster in Rust... it may be sometimes but lookups are still O(log(n)) due to the searching where with a HashMap it's (mostly) O(1). They both have their uses - I usually go for BTreeMap when I explicitly need the collection to be ordered.
A second reason is sharding - sharding based on a hash is quite simple to do, but sharding an ordered collection would be quite difficult since some reads would need to search across multiple shards and thus take multiple locks.
If you mean internally (like for each shard), we're using hashbrown's raw HashTable API because it allows us to manage hashing entirely ourselves, and avoid recomputing the hash when determining the shard and looking up a key within a shard.
CyberDildonics a day ago |
Why would you use a b tree if you don't need sorting? It will not only be slower but require a lot more to make lock free (is this hash map lock free?).
conradludgate a day ago |
I don't think I'd recommend using this in production. The benchmarks look good, but by immediately waking the waker[0], you've effectively created a spin-lock. They may work in some very specific circumstances, but they will most likely in practice be more costly to your scheduler (which likely uses locks btw) than just using locks
[0]: https://github.com/fortress-build/whirlwind/blob/0e4ae5a2aba...
willothy a day ago |
I don't believe that waking the waker in `poll` synchronously waits / runs poll again immediately. I think it is more likely just adding the future back to the global queue to be polled. I could be wrong though, I'll look into this more. Thanks for the info!
conradludgate a day ago |
It does immediately put itself into the queue to be polled again. But that's no different in effect to a spin-lock. If you have other tasks in your runtime, this will be putting excess pressure on your scheduler
conradludgate a day ago |
Expanding on this. If you have a lot of concurrent tasks, you will overflow[0] the task local queue and be bottlenecked by the global queue mutex[1]
[0]: https://github.com/tokio-rs/tokio/blob/8897885425bf3d89053f8... [1]: https://github.com/tokio-rs/tokio/blob/8897885425bf3d89053f8...
willothy a day ago |
Oh this is really good to know, thank you!
willothy a day ago |
Hmm, imo it's definitely better than directly spinlocking to have many spinlocks running cooperatively, but you're right that it may not be ideal. Thanks for pointing this out. I'll see if I can find a better way to coordinate the polling/waking of lock acquisition futures.
jkelleyrtp a day ago |
Tokio has a task budget that will cap at 32 or 256 polls of the same task before switching to another task. So, yes a spinlock, but not likely to deadlock.
willothy a day ago |
Yeah, after looking into this more I think this was a big oversight on my part. Working on a (hopefully) better way of doing this right now - I'm thinking per-shard waker queues and only falling back to spinlocking like this if the queues are full.
binarycoffee a day ago |
Another offender is AtomicWaker [1] which does the same on contention.
[1] https://docs.rs/atomic-waker/latest/atomic_waker/
SabrinaJewson a day ago |
AtomicWaker is much less bad, because it will only spin in the case that another thread has spuriously called `wake` – i.e. called `wake` despite that fact that the future it’s waking is in fact unable to progress.
In the common case, the waker side will operate some logic like:
set_flag(); atomic_waker.wake();
and the future will run:
atomic_waker.register(cx.waker()); if flag_is_set() { Poll::Ready(()) } else { Poll::Pending }
Thus, even if `register` decides to “spin”, the flag will already be set, and so the future will not spin in reality (it may be polled unnecessarily, but this will only happen once).
I can’t immediately think of examples where support for spurious waking is necessary.
binarycoffee a day ago |
The producers of an MPSC queue may use an AtomicWaker::wake to signal to the consumer that a new item is ready. In this case all wake-ups are necessary (not spurious).
sujayakar a day ago |
+1. I'd be curious how much of a pessimization to uncontended workloads it'd be to just use `tokio::sync::RwLock`.
and, if we want to keep it as a spinlock, I'm curious how much the immediate wakeup compares to using `tokio::task::yield_now`: https://docs.rs/tokio/latest/tokio/task/fn.yield_now.html
willothy a day ago |
This is an interesting idea. I am gonna try this out - especially with dashmap, I think that could perform very well.
zamalek a day ago |
You could also look into shamelessly "taking inspiration" from async-lock.
PixelPulse97 21 hours ago |
I don't see the use case of this