The Fastest Mutexes
874 points by jart 2 days ago | 340 comments
  • kccqzy 2 days ago |
    > The reason why Cosmopolitan Mutexes are so good is because I used a library called nsync. It only has 371 stars on GitHub, but it was written by a distinguished engineer at Google called Mike Burrows.

    Indeed this is the first time I've heard of nsync, but Mike Burrows also wrote Google's production mutex implementation at https://github.com/abseil/abseil-cpp/blob/master/absl/synchr... I'm curious why this mutex implementation is absent from the author's benchmarks.

    By the way if the author farms out to __ulock on macOS, this could be more simply achieved by just using the wait(), notify_one() member functions in the libc++'s atomic library.

    A while ago there was also a giant thread related to improving Rust's mutex implementation at https://github.com/rust-lang/rust/issues/93740#issuecomment-.... What's interesting is that there's a detailed discussion of the inner workings of almost every popular mutex implementation.

    • tialaramex 2 days ago |
      Although it does get there eventually, that Rust thread is about Mara's work, which is why it eventually mentions her January 2023 book.

      The current Rust mutex implementation (which that thread does talk about later) landed earlier this year and although if you're on Linux it's not (much?) different, on Windows and Mac I believe it's new work.

      That said, Mara's descriptions of the guts of other people's implementations is still interesting, just make sure to check if they're out-dated for your situation.

      • SkiFire13 2 days ago |
        > although if you're on Linux it's not (much?) different

        AFAIK one reason to switch was that mutexes on Linux and MacOS were not guaranteed to be moveable, so every rust's Mutex had to box the underlying os mutex and was not const-constructible. So this makes a considerable change.

        • tialaramex 2 days ago |
          That's the reason for Mara's Mutex. I know, it doesn't seem like five minutes, but Mara's is now the previous version of the Rust Mutex implementation.

          std::sync::Mutex::new became const in 1.63, which was over two years ago.

          • lilyball 2 days ago |
            It looks like Mara also did the work to make it const-constructible, so it's still her implementation, no? https://github.com/rust-lang/rust/pull/97647
            • tialaramex 2 days ago |
              Sorry, maybe I was unclear, when I pointed out that this work was in 2022 that's not because I was trying to say it's recent, that's two years ago. The current work was in 2024.

              Here's the 2024 rework of Windows Mutex: https://github.com/rust-lang/rust/pull/121956/

              Edited to add: Since we're going back and forth I read the blame log, the Linux implementation, with which I'm most familiar, is still 85% Mara's although with some meaningful changes from 2024.

    • the-rc 2 days ago |
      Burrows is also responsible for the Burrows Wheeler Transform, Bigtable, Dapper and Chubby, among others.
    • loeg 2 days ago |
      > I'm curious why [Abseil's] mutex implementation is absent from the author's benchmarks.

      Possibly because it's C++ (as opposed to C)? I am speculating.

    • BryantD 2 days ago |
      Mike was a legend by the time I got to AV. The myth was that any time the search engine needed to be faster, he came in and rewrote a few core functions and went back to whatever else he was doing. Might be true, I just can't verify it personally. Extremely smart engineer who cares about efficiency.

      We did not, however, run on one server for any length of time.

    • diedyesterday 20 hours ago |
      He seems to have won an ACM award also and there as a (possible) picture of him there. https://awards.acm.org/award-recipients/burrows_9434147
  • rnrn 2 days ago |
    is comsmopolitan’s mutex also less fair than the other implementations compared?
    • cogman10 2 days ago |
      It's not fair, but (from the description) it does make some effort to be fairish. It'll queue up waiters in a linked list that is fairly fair, but new people can jump ahead of the line and grab the CAS before the list is processed.

      However, it has added logic to start chunking through the queue after 30 wakeups from a waiter. With that, waiting isn't indefinite, but it also isn't fair.

      I have no idea how that compares to other implementation's fairness. I know the JDK recently abandoned fairness because of the overhead and complexity it added around mutex handling.

      • tialaramex 2 days ago |
        Fairness is known to cause serious problems such as convoys, so while "unfair" isn't itself a desirable property, you might choose to be unfair because the alternatives are worse for some applications.

        Probably a specifically Fair version of any concurrent primitive which can be fair is worth giving a distinct name, the way you'd offer both a stable and an unstable sort, knowing that the unstable sort will often (though not always) be faster but some people cannot tolerate an unstable sort so it's useless for them.

        On the other hand, maybe it's an attractive nuisance and if you offered this you'd find most users took your FairMutex, and then were angry because of the inevitable problems from fairness, even though the unfair one was right there...

        • greiskul 2 days ago |
          In a lot of cases, programmers don't even care about fairness, but do care about starvation. Is there a word for structures like the one discussed here, that are unfair but appear to prevent unlimited starvation?
          • DSMan195276 2 days ago |
            I don't think this lock is _guaranteed_ to prevent starvation, it just makes an effort at it. There's only two priority levels and a hard-coded 30 wake-ups required to enter high priority - if waiters were continually joining then there could always be more than one entry in high priority and an entry could get stuck there forever. Typically it won't matter, but if you have high contention then this might not be good enough.
    • pizlonator 2 days ago |
      Most fast locks these days are unfair. It turns out the perf advantage of unfairness is so massive that it’s just Better (TM).
    • worstspotgain 2 days ago |
      All primitive performance evaluations should be multidimensional, IMO, with fairness being one of the highest-priority variables.

      The importance of fairness varies with the algorithm. For instance, one that alternates between being CPU-bound and disk-bound may use N_workers >>> N_cores. These workers might not care about individual starvation at all.

      At the other end, consider a strictly CPU-bound algorithm with N_workers==N_cores, and a rogue-ish worker that ends up hogging an important mutex by spinning around it (e.g. the one written by author of the benchmark.) If the other well-behaved workers briefly lock the mutex once prior to each job, but end up getting woken up only once every 30 locks (as this implementation apparently does,) you might end up with core utilization <<< 100%.

      This in turn means that APIs that let you customize the behavior can win out even if they're not #1 on any benchmark.

  • tombert 2 days ago |
    I have to admit that I have an extremely visceral, negative feeling whenever I see a mutex, simply because I've had to debug enough code written by engineers who don't really know how to use them, so a large part of previous jobs has been to remove locks from code and replace with some kind of queue or messaging abstraction [1].

    It's only recently that I've been actively looking into different locking algorithms, just because I've been diving in head-first to a lot of pure concurrency and distributed computing theory, a lot of which is about figuring out clever ways of doing mutexes with different tradeoffs.

    I've gotten a lot better with them now, and while I still personally will gravitate towards messaging-based concurrency over locks, I do feel the need to start playing with some of the more efficient locking tools in C, like nsync (mentioned in this article).

    [1] Before you give me shit over this, generally the replacement code runs at roughly the same speed, and I at least personally think that it's easier to reason about.

    • aidenn0 2 days ago |
      I feel the similarly about C"s "volatile" (when used in multithreaded code rather than device drivers). I've seen people scatter volatile around randomly until the problem goes away. Given that volatile significantly disturbs the timing of a program, any timing sensitive bugs can be masked by adding it around randomly.
      • cogman10 2 days ago |
        There seems to be a lot of voodoo beliefs around concurrent programming that lead to really bad things.

        One of the best books I've read on it is Java concurrency in practice [1]. It does an excellent job of dispelling these occultic beliefs and letting the reader know exactly when and how concurrency should be implemented. It is applicable to more languages than just java, especially since many have adopted large parts of the java memory model.

        The worst things I usually find when reviewing concurrent code is people either not using locks when they should, using locks when they shouldn't, and having inconsistent data guards. I've seen people throw in random locks to guard local non-shared state which is just crazy town but "Multiple threads are running this code, so I'm adding a lock".

        I certainly prefer message passing over shared state. However, it's a little baffling to me why it's so hard for devs to grasp how to properly maintain shared state. Instead of just learning the basic rules, it gets couched in "It's just too hard to understand so keep adding things until it works".

        [1] https://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz...

        • LegionMammal978 2 days ago |
          > However, it's a little baffling to me why it's so hard for devs to grasp how to properly maintain shared state. Instead of just learning the basic rules, it gets couched in "It's just too hard to understand so keep adding things until it works".

          Probably because most people aren't aware that there are basic rules to be learned. I'd imagine the typical experience is, you're very familiar with single-threaded code, and now you're trying to let other threads work with your data. You have heard that there are many pitfalls, and that there are special-purpose tools like mutexes to avoid those, but you look at the examples and find them mostly baffling. "Why do they perform these incantations for this data but not that data, or in this place but not that place?" So you come up with some weird mental model and move on with your life, never aware that there are underlying principles for maintaining shared state.

          Personally, I didn't understand mutexes very well at all, until I started looking into what the atomic memory orderings from C++ et al. were supposed to mean.

          • Maxatar 2 days ago |
            Not too sure what the basic rules are and I'm not able to find any list of such rules.

            For me the biggest challenge when sharing state is that the only benefit I can see for parallelism is performance, so if I'm not gaining performance there is no reason to use parallelism. If I use coarse-grained mutexes then I end up with straight forward to reason about code but I lose the performance benefit and in fact can end up with slower than single threaded code.

            If I use very fine grained mutexes then I end up with faster code that has very hard to find bugs that happen on very rare occasion.

            And then on top of that even if you do write correct fine grained locking, you can still end up with slow code due to cache behavior such as false sharing and cache coherence.

            So ultimately I disagree that writing parallel code is simple unless you're willing to give up performance in which case you may as well just stick to single threaded code or use parallelism among independent data. Writing correct parallel software that shares state and actually delivers substantial performance benefits is incredibly difficult, and I am skeptical that there is a set of simple rules that one can simply read about.

            • tialaramex 2 days ago |
              > Not too sure what the basic rules are and I'm not able to find any list of such rules.

              The actual rules are completely terrifying because they involve the physics of microprocessors. If you've watched Grace Hopper's lectures where she gives out physical nanoseconds (pieces of wire that are the same length as the distance light travels in a nanosecond, thus, the maximum possible distance data could travel in that time) you can start to appreciate the problem. It is literally impossible for the intuitive Sequentially Consistent model of how computers work to apply for today's fast yet concurrent processors. Light is too slow.

              However generally people mean either Java's memory model or the C++ 11 (and subsequently 14, 17, 20) memory models used in languages such as C++, C and Rust. Those rules are less terrifying but still pretty complicated and the programming language promises to somehow provide an environment where these rules (not the terrifying ones) are all you need to know to write software. So that's nice.

              It can be simple to write parallel code for a language designed to make that easy. Yes even if there's shared data. It only started to get trickier if the shared data is modified, so long as it isn't we can make copies of it safely and modern CPUs will do that without actual work by the programmer.

              • cogman10 2 days ago |
                Are there popular languages that don't have memory models which make reasoning about concurrent models easier?

                A language with a notion of threading and shared state is going to have something akin to read/write barriers built into the language memory model to tame the beast.

                • jcranmer 2 days ago |
                  I think tialaramex is overselling the complexity of concurrent memory models in practice, at least for end users. In reality, all modern memory models are based on the data-race-free theorem, which states that in the absence of data races--if your program is correctly synchronized--you can't tell that the hardware isn't sequentially consistent (i.e., what you naïvely expected it to do).

                  Correct synchronization is based on the happens-before relation; a data race is defined as a write and a conflicting read or write such that neither happens-before the other. Within a thread, happens-before is just regular program order. Across a thread, the main happens-before that is relevant is that an release-store on a memory location happens-before an acquire-load on that memory location (this can be generalized to any memory location if they're both sequentially-consistent, but that's usually not necessary).

                  The real cardinal rule of concurrent programming is to express your semantics in the highest-possible level of what you're trying to do, and find some library that does all the nitty-grityy of the implementation. Can you express it with fork-join parallelism? Cool, use your standard library's implementation of fork-join and just don't care about it otherwise.

                • xxpor 2 days ago |
                  C?
                  • tialaramex 2 days ago |
                    C has the same model as C++ from the same era, so C11 is the C++ 11 model, C23 is C++ 20 and so on.

                    It's C so you don't get a comprehensive set of bells, whistles and horns like the C++ standard library, but the actual model is the same. At a high level it's all the same as C++ 11, the details are not important to most people.

            • zozbot234 2 days ago |
              > Not too sure what the basic rules are and I'm not able to find any list of such rules.

              You may want to consider https://marabos.nl/atomics/ for an approachable overview that's still quite rigorous.

            • cogman10 2 days ago |
              > Not too sure what the basic rules are and I'm not able to find any list of such rules.

              I'd suggest the book in my original comment, Java concurrency in practice.

              > If I use very fine grained mutexes then I end up with faster code that has very hard to find bugs that happen on very rare occasion.

              I agree this is a real risk if you are doing fine grained mutexes. But the rules are the same whether or not you want to follow them. If you have shared state (A, B, C) and you want to do a calculation based on the values of (A, B, C) then you need a mutex which locks (A, B, C). Certainly, that become a problem if you have calculations that just require (A, C) and you might want to avoid locking for B. In that case, you need a more complicated mechanism for locking than just simple mutexes which is certainly easy to get wrong. When the (A, B, C) actions happen you have to ensure that the (A, C) actions can't happen at the same time.

              This isn't a complicated rule, but it is one that can be hard to follow if you are trying to do super fine grained locking. It's even trickier if you are going to abuse the platform to get correct results.

              But fine v coarse isn't the problem I'm referring to when I say people get the simple rules wrong. Rather, than worrying about fine vs coarse grained locking, I very frequently see code where mutexes and concurrency primitives are just peppered everywhere and haphazardly. We might call that super coarse grained.

            • SJC_Hacker 2 days ago |
              > For me the biggest challenge when sharing state is that the only benefit I can see for parallelism is performance, so if I'm not gaining performance there is no reason to use parallelism.

              Aside from performance, another very common reason is to not lock the UI from the user. Even in UI-less programs, the ability to abort some operation which is taking too long. Another is averaging out performance of compute tasks, even in the case where it would be faster to handle them sequentially. Without some degree of parallelism these things are not possible.

              Consider a web server. Without parallelism every single request is going to completely lock the program until its complete. With parallelism, you can spawn off each request, and handle new ones as they come in. Perceived performance for majority of users in this case is significantly improved even in the case of single processor system - e.g. you have 99 requests which each take a single second, and then one which takes 101 seconds. Total request time is 200 seconds / 100 requests = 2 seconds average per request, but if that 100 second request comes in first, the other 99 are locked for 100 seconds, so average is now > 100 seconds per request ...

              • Maxatar 2 days ago |
                >Aside from performance, another very common reason is to not lock the UI from the user.

                This is not a good fit for parallelism, this is pretty much always accomplished using concurrency ie. async/await.

                • int_19h 2 days ago |
                  Assuming that the APIs & libraries that you need are async. Which is, unfortunately, not always the case for historical reasons.
        • f1shy 2 days ago |
          Maybe because I had a complete semester of multiprogramming in the uni, I see almost trivial to work in such environments, and cannot comprehend why is so much mystic and voodo. Actually is pretty simple.
          • tombert 2 days ago |
            I feel like it's not terribly hard to write something that more or less works using mutexes and the like, but I find it exceedingly hard to debug. You're at the mercy of timing and the scheduler, meaning that often just throwing a breakpoint and stepping through isn't as easy as it would be with a sequential program.

            I feel like with a queue or messaging abstraction, it can be easier to debug. Generally your actual work is being done on a single thread, meaning that traditional debugging tools work fine, and as I've said in sibling comments, I also just think it's easier to reason about what's going on.

        • tombert 2 days ago |
          +1 for the Java Concurrency in Practice book. It's the book I recommend to nearly everyone who wants to get into concurrent programming. Goetz makes it a lot more approachable than most other books.
          • hinkley 2 days ago |
            Goetz has come a long way. I knew one of the people who contributed to that book and he was a little frustrated about having to explain things to him he felt he shouldn’t have had to. The implication was he’d already had this conversation with some of the other contributors.

            Sometimes though, the newbie is going to write the clearest documentation.

        • hinkley 2 days ago |
          I loved concurrent code when I was starting out. I’d taken a pretty good distributed computing class which started the ball rolling. They just fit into how my brain worked very well.

          Then I had to explain my code to other devs, either before or after they broke it, and over and over I got the message that I was being too clever. I’ve been writing Grug-brained concurrent code for so long I’m not sure I can still do the fancy shit anymore, but I’m okay with that. In fact I know I implemented multiple reader single writer at least a few times and that came back to me during this thread but I still can’t remember how I implemented it.

          • tombert 2 days ago |
            That's something I'm afraid of for my latest project. I did some concurrent stuff that wasn't 100% clear would actually work, and I had to write a PlusCal spec to exhaustively prove to myself that what I was doing is actually OK.

            It works pretty well, and I'm getting decent speeds, but I'm really scared someone is going to come and "fix" all my code by doing it the "normal" way, and thus slow everything down. I've been trying to comment the hell out of everything, and I've shared the PlusCal spec, but no one else on my team knows PlusCal and I feel like most engineers don't actually read comments, so I think it's an inevitability that my baby is killed.

      • tialaramex 2 days ago |
        In most cases (in a C or C++ compiler, not Java) it's just straight up incorrect to use volatile for something other than memory mapped I/O. Yes, POSIX guarantees that in a specific case (signal handling IIRC) it'll do what you meant if you use volatile int. That's nice, but more generally this is not the right choice.

        Unfortunately Microsoft enshrined the situation (on Windows, on their compiler, on x86 and x86-64 only) that volatile primitive types are effectively atomics with Acquire-Release ordering. This is of course awkward when Microsoft tries to bring people to a non-x86 architecture and it can't just give them this because it would suck really hard, so finally they have to grow up and teach their developers about actual atomics.

        !! Edited to fix: Previously this said Relaxed ordering, the ordering guaranteed by Microsoft is in fact Acquire-Release, hence it's expensive to provide for architectures where that's not the default.

        • hinkley 2 days ago |
          When Java implemented volatile it didn’t do anything. Later when they fixed the memory model to deal with out of order execution they made it part of the fence semantics, and then it actually made some sense.
      • zozbot234 2 days ago |
        The "volatile" keyword should never be used for C/C++ multithreaded code. It's specifically intended for access to device-mapped addresses and does not account for any specific memory model, so using it for multithreading will lead to breakage. Please use the C/C++ memory model facilities instead.

        (As a contrast, note that in Java the "volatile" keyword can be used for multithreading, but again this does not apply to C/C++.)

        • hinkley 2 days ago |
          I’m surprised that’s true. C borrowed very heavily from Java when fixing the NUMA situations that were creeping into modern processors.
          • jcranmer 2 days ago |
            The C/C++ memory model is directly derived from the Java 5 memory model. However, the decision was made that volatile in C/C++ specifically referred to memory-mapped I/O stuff, and the extra machinery needed to effect the sequential consistency guarantees was undesirable. As a result, what is volatile in Java is _Atomic in C and std::atomic in C++.

            C/C++ also went further and adopted a few different notions of atomic variables, so you can choose between a sequentially-consistent atomic variable, a release/acquire atomic variable, a release/consume atomic variable (which ended up going unimplemented for reasons), and a fully relaxed atomic variable (whose specification turned out to be unexpectedly tortuous).

            • tialaramex 2 days ago |
              Importantly these aren't types they're operations.

              So it's not that you have a "release/acquire atomic variable" but you have an atomic variable and it so happens you choose to do a Release store to that variable, in other code maybe you do a Relaxed fetch from the same variable, elsewhere you have a compare exchange with different ordering rules

              Since we're talking about Mutex here, here's the entirety of Rust's "try_lock" for Mutex on a Linux-like platform:

                      self.futex.compare_exchange(UNLOCKED, LOCKED, Acquire, Relaxed).is_ok()
              
              That's a single atomic operation, in which we hope the futex is UNLOCKED, if it is we store LOCKED to it with Acquire ordering, but, if it wasn't we use a Relaxed load to find out what it was instead of UNLOCKED.

              We actually don't do anything with that load, but the Ordering for both operations is specified here, not when the variable was typed.

        • aidenn0 2 days ago |
          > Please use the C/C++ memory model facilities instead

          I should point out that for more than half of my professional career, those facilities did not exist, so volatile was the most portable way of implementing e.g. a spinlock without the compiler optimizing away the check. There was a period after which compilers were aggressively inlining and before C11 came out in which it could be otherwise quite hard to otherwise convince a compiler that a value might change.

          • int_19h 2 days ago |
            The problem is that volatile alone never portably guaranteed atomicity nor barriers, so such a spinlock would simply not work correctly on many architectures: other writes around it might be reordered in a way that make the lock useless.

            It does kinda sorta work on x86 due its much-stronger-than-usual guarantees wrt move instructions even in the absence of explicit barriers. And because x86 was so dominant, people could get away with that for a while in "portable" code (which wasn't really portable).

            • aidenn0 a day ago |
              There's a lot to unpack here.

              TL;DR: The compiler can reorder memory accesses and the CPU can reorder memory accesses. With a few notable exceptions, you usually don't have to worry about the latter on non-SMP systems, and volatile does address the former.

              The volatile qualifier makes any reads or writes to that object a side-effect. This means that the compiler is not free to reorder or eliminate the accesses with respect to other side-effects.

              If you have all 3 of:

              A) A type that compiles down to a single memory access

              B) within the same MMU mapping (e.g. a process)

              C) With a single CPU accessing the memory (e.g. a non-SMP system)

              Then volatile accomplishes the goal of read/writes to a shared value across multiple threads being visible. This is because modern CPUs don't have any hardware concept of threads; it's just an interrupt that happens to change the PC and stack pointer.

              If you don't have (A) then even with atomics and barriers you are in trouble and you need a mutex for proper modifications.

              If you don't have (B) then you may need to manage the caches (e.g. ARMv5 has virtually tagged caches so the same physical address can be in two different cache lines)

              If you don't have (C) (e.g. an SMP system) then you need to do something architecture specific[1]. Prior to C language support for barriers that usually means a CPU intrinsic, inline assembly, or just writing your shared accesses in assembly and calling them as functions.

              Something else I think you are referring to is if you have two shared values and only one is volatile, then the access to the other can be freely reordered by the compiler. This is true. It also is often masked by the fact that shared values are usually globals, and non-inlined functions are assumed by most compilers to be capable of writing to any global so a function call will accidentally become a barrier.

              1: As you mention, on the x86 that "something" is often "nothing." But most other architectures don't work that way.

      • senderista a day ago |
        If you only use volatile in C without any atomic operations or fences, then your multithreaded code is certainly incorrect.
    • jart 2 days ago |
      What are some examples of people using mutexes wrong? I know one gotcha is you need to maintain a consistent hierarchy. Usually the easiest way to not get snagged by that, is to have critical sections be small and pure. Java's whole MO of letting people add a synchronized keyword to an entire method was probably not the greatest idea.
      • cogman10 2 days ago |
        When, how, and why.

        The biggest part of mutexes and how to properly use them is thinking of the consistency of the data that you are working with.

        Here's a really common bug (psuedocode)

            if (lock {data.size()} > 0) {
              value = lock { data.pop() }
              lock { foo.add(value) }
            }
        
        The issue here is size can change, pop can change, and foo can change in unexpected ways between each of the acquired locks.

        The right way to write this code is

            lock {
              if (data.size() > 0) {
                value = data.pop()
                foo.add(value)
              }
            }
        
        That ensures the data is all in a consistent state while you are mutating it.

        Now, what does make this tricky is someone well-meaning might have decided to push the lock down a method.

        Imagine, for example, you have a `Foo` where all methods operate within a mutex.

        This code is also (likely) incorrect.

            value = foo.bar()
            if (value.bat()) {
              foo.baz(value)
            }
        
        The problem here is exactly the same problem as above. Between `foo.bar()` and `foo.baz()` the state of foo may have changed such that running `foo.baz(value)` is now a mistake. That's why the right thing to do is likely to have a `foo.barBaz()` method that encapsulates the `if` logic to avoid inconsistency (or to add another mutex).

        In java, the most common manifestation (that I see) of this looks like this

            var map = new ConcurrentHashMap();
            if (map.get(foo) == null)
              map.put(foo, new Value());
        
        Because now, you have a situation where the value of `foo` in the map could be 2 or more values depending on who gets it. So, if someone is mutating `Value` concurrently you have a weird hard to track down data race.

        The solution to this problem in java is

            map.computeIfAbsent(foo, (unused)->new Value());
        • hinkley 2 days ago |
          Composing locks is where Java usually blows up.

          And computeIfAbsent can end up holding the lock for too long if the function is slow.

          • foobazgt 2 days ago |
            Composing locks isn't a Java problem - it's a fundamental abstraction problem with locks. This is one of the reasons why you usually reach for higher level abstractions than mutexes.

            > And computeIfAbsent can end up holding the lock for too long if the function is slow.

            How is this different from any other lock-holding code written anywhere?

            • hinkley 2 days ago |
              I’m saying Java is exceptionally bad at this because every object is its own mutex.

              And you end up having to trade single core performance for multi core by deciding to speculatively calculate the object. If there’s no object to make the critical section is very small. But as the object sprouts features you start smashing face first into Amdahl.

              • foobazgt 2 days ago |
                > because every object is its own mutex.

                Not true in any practical sense.

                > And you end up having to trade single core performance for multi core by deciding to speculatively calculate the object.

                What is the alternative you suggest? If you care about having the predicate actually hold, and you also don't want to have to hold the lock while constructing the object, then you're going to end up in an optimistic-concurrency scenario where you check the predicate under lock, compute the object, and check again before swapping the value in. You may end up having to throw your work away when you discover the predicate changed. Java is no better nor worse at doing this than anything else.

                • hinkley 2 days ago |
                  > Not true in any practical sense.

                  This is going to put a damper on any further conversation.

                  Even with coarsening and elision every synchronized function closes a lock on the enclosing object.

                  • int_19h 2 days ago |
                    Do people actually use `synchronized` methods in Java these days? It's been commonly described as an anti-pattern (for all the reasons discussed upthread here) two decades ago already.
                    • hinkley 2 days ago |
                      The more useful question is has it been expunged from the JDK and common libraries. I think it's been more like 10-12 years since it really started being talked about in more than certain subcommunities and that's almost 20 years' worth of existing libraries.

                      OpenTelemetry is a fairly recent library. Even if you ignore some test fakes (where, let's face it, who cares), it still uses it in a few places, and uses lock objects in others. I don't see much evidence of recursion going on with the former. But that's how things always start and later there's running and screaming.

                      • int_19h 2 days ago |
                        Some amount of legacy cruft is not unexpected, but it's sad that it can be seen in new code. In .NET, which has similarly problematic semantics with lock(), linters have been flagging lock(this) for ages.

                        I wonder where this patently bad idea of every object carrying its own publicly accessible mutex originated in the first place. Did Java introduce it, or did it also copy that from somewhere else? And what was the motivation?

                        • neonsunset 2 days ago |
                          Can't attest to the history of `lock` statement from the top of my head but the API shape of lock and Monitor.Enter/Exit methods it is desugared to looks like Win32's EnterCriticalSection and LeaveCriticalSection. Other Monitor's methods like Wait and Pulse look like pthread's condvar and mutex functions.

                          .NET also has MethodImplOptions.Synchronized like Java does. However, the only place I have ever seen this attribute was on TextWriter.Synchronized implementation in CoreLib and nowhere else.

                          Java itself has `Lock` and `Condition`. In the end, most synchronization primitives do the same high-level actions and bound to end up having similar API.

                          As for `lock(this)`, much like with many other historically abused techniques that have become frowned upon - it's not bad per se if you own the type and know that it is internal and will not be observed outside of the assembly it is defined in, provided it is small enough. It's footgun-prone, but generally very few code paths will lock an arbitrary object instance at all, so most of the time it's something you see so rarely it has become "just write a comment why and move on" when using it. Of course this requires more deliberation and it's easier to default to blanket policies that ignore context. It can be difficult to get people to "use the appropriate tool" mentality.

                          .NET is also getting it's a separate `Lock` type, on top of all the existing synchronization primitives, to move a little further away from other legacy aspects of `lock`ing on object instances.

                          • int_19h 2 days ago |
                            It's not Monitor itself that's problematic. It's that every object is implicitly associated with one, and anyone who holds a reference to an object can lock it. It doesn't matter if the type is internal - it can still be upcast to System.Object and leaked that way.

                            In practice this means that unless you can guarantee that you never, ever leak a reference anywhere, you don't know who else might be locking it. Which makes it impossible to reason about possible deadlocks. So the only sane way to manage it is to have a separate object used just for locking, which is never ever passed outside of the object that owns the lock.

                            And yes, this is absolutely bad design. There's no reason why every object needs a lock, for starters - for the vast majority of them, it's just unnecessary overhead (and yes, I know the monitors are lazily created, but every object header still needs space to store the reference to it). Then of course the fact that it's there means that people take the easy path and just lock objects directly instead of creating separate locks, just because it's slightly less code - and then things break. It's almost always the wrong granularity, too.

                            Thing is, I haven't seen this design anywhere outside of Java and .NET (which copied it from Java along with so many other bad ideas). Everybody else uses the sane and obvious approach of creating locks explicitly if and when they are needed.

                        • senderista a day ago |
                          Monitors came from Tony Hoare in the 70s and Java put an OO spin on them.
                  • foobazgt 2 days ago |
                    "every synchronized function"

                    Right. Synchronized is the key word here. The vast majority of code doesn't involve synchronized, and therefore the vast majority of objects don't have locks associated with them. That's quite important.

                    Those classes which do use synchronized were just going to create a ReentrantLock held for the duration of the call anyway, in which case it's all monitorEnter and monitorExit, regardless.

                    > This is going to put a damper on any further conversation.

                    Sadge.

                    • foobazgt a day ago |
                      > in which case it's all monitorEnter and monitorExit, regardless.

                      Oops, I need to correct myself!

                      ReentrantLock doesn't depend upon monitorEnter/Exit, but rather AbstractQueuedSynchronizer and LockSupport, which ultimately delegate to Unsafe methods like park/unpark and CAS (*compareAndSet*). Don't know why I had that confused in my head.

                      In any case, the point holds that "synchronized" as a language feature has mostly a zero cost for code that doesn't use it. It's a red herring when discussing modern Java concurrency.

        • another-acct 2 days ago |
          Might want to move foo.add() out of the lock scope (assuming foo is a thread-private resource):

              value = nil
              lock {
                if (data.size() > 0) {
                  value = data.pop()
                }
              }
              if (value) {
                  foo.add(value)
              }
        • samatman 2 days ago |
          This is one of the areas where Zig's combination of anonymous blocks and block-based defer really pay off. To create a locked region of code is just this

              {
                  mutex.lock();
                  defer mutex.unlock();
                  // Do mutex things
              }
          
          It's possible to get this wrong still, of course, but both the anonymous scope and the use of `defer` make it easier to get things right.

          Nothing can prevent poor engineering around mutex use though. I'd want a critical path for a concurrent hashmap to look like this:

              {
                  shared_map.lock();
                  defer shared_map.unlock();
                  if (shared_map.getOrNull(foo) == null) {
                      shared_map.put(foo, new_val);
                  }
              }
          
          Where the SharedMap type has an internal mutex, and a way to check it, and all operations panic if no lock has been acquired. It could have `shared_map.lockAndGet(OrNull?)(...)`, so that the kind of problem pattern you're describing would stand out on the page, but it's still a one-liner to do an atomic get when that's all you need the critical path to perform.

          I don't think these invariants are overly onerous to uphold, but one does have to understand that they're a hard requirement.

          • int_19h 2 days ago |
            This doesn't seem to add anything over and above what std::mutex in C++ or a synchronized block in Java offer?
            • senderista a day ago |
              Less than C++. defer() is strictly inferior to RAII.
        • jhatemyjob 2 days ago |
          I digress but my autistic brain couldn't help itself. Provided that it's a recursive lock you could do this instead of adding a new method `foo.BarBaz`

              foo.lock {
                  value = foo.bar() // foo.lock within this method is ignored
                  if(value.bat()) {
                      foo.baz() // foo.lock within this method is ignored
                  }
              }
          
          Also, to catch this bug early, you could assert foo is locked in `value.bat` or something. But that may or may not be feasible depending on how the codebase is structured
      • tombert 2 days ago |
        Personally I've had issues with performance because of people using `synchronized` too liberally, where they end up locking a lot more code than necessary. I've also had issues with fairly typical circular-dependencies, causing deadlock, or at least pauses that aren't strictly necessary. Deadlock doesn't happen nearly as often as textbooks have led me to believe, but it can happen with sloppily written code.

        In regards to Java, at this point I almost never use the `synchronized` keyword anymore and instead (if I can't easily map to some kind of queuing abstraction) use the ReentrantLock object simply because of the ability to have lock acquisition time out, and also letting you opt-in to fairness if you'd like. It's not as pretty but it's more flexible and as far as I'm aware it doesn't affect performance much.

        For the most part, though, in Java, you can get away without (explicit) locks by simply abusing the built-in data structures. I know they're using their own synchronization techniques behind the scenes, but I trust those to be correct more than some ad-hoc stuff I'd write as an engineer.

      • foobazgt 2 days ago |
        Java's take on monitors was definitely not great, and people were emulating mutexes with them even in the language's earliest days.

        Still there are a lot of things that can go wrong with mutexes: forgetting to unlock in the case of exceptions, priority inversion, recursive locking, deadlock, needlessly high contention, etc.

        Java has had an excellent concurrency runtime with abstractions that are typically a better fit than a bare mutex for over 20 years now (c.f. Doug Lea). Synchronized still exists, because of Java's excellent backwards compatibility.

      • foobiekr 2 days ago |
        I've always disliked that lock cyclic dependencies is discussed as a hierarchy when what it really comes down to is a linear order of locks.

        The problem with lock _hierarchies_ as a concept is that a lock really should represent serialization of access to a particular pool of data, and should make no assumptions that it being held implies some other lock's domain is also held. The code that results when people do not maintain this kind of rigor is quite terrible, but hierarchies tend to steer people into thinking that way because they imply recursively taking locks.

        Stated differently: locks should be taken and released in a fixed order - so locks are ranked - but there should not be a model where all lower-ranked locks must be held for a given lock to be taken. The lock protects its domain and the ordering of take and release is to prevent deadlock, but there's no requirement for completeness.

    • bob1029 2 days ago |
      > some kind of queue or messaging abstraction

      Agreed. I find things like LMAX Disruptor much easier to reason about.

      • tombert 2 days ago |
        Even within Java, something like BlockingQueue will get you pretty far, and that's built into the runtime.

        If I am allowed to use libraries, I end up using Vert.x for nearly everything. I think that their eventbus abstraction is easy enough to reason about, and even without using it simply using the non-blocking stuff it provides ends up being pretty handy.

    • jeffbee 2 days ago |
      Message passing is just outsourcing the lock, right? For example a Go channel is internally synchronized, nothing magic about it.

      Most of the mutex tragedies I have seen in my career have been in C, a useless language without effective scopes. In C++ it's pretty easy to use a scoped lock. In fact I'd say I have had more trouble with people who are trying to avoid locks than with people who use them. The avoiders either think their program order is obviously correct (totally wrong on modern CPUs) or that their atomics are faster (wrong again on many CPUs).

      • loeg 2 days ago |
        > Message passing is just outsourcing the lock, right?

        More or less, yeah. You can write an MPSC queue that doesn't explicitly use a lock (or even anything that looks like a lock).

      • tombert 2 days ago |
        It's definitely doing synchronization behind the scenes, no argument here. BlockingQueues in Java seem to use ReentrantLocks everywhere. It's outsourcing the lock to people who understand locks better.

        It just abstracts this detail away for me, and I personally trust the libraries implementing these abstractions to be more correct than some ad hoc thing I write. It's an abstraction that I personally find a lot easier to reason about, and so my thinking is this: if my reasoning is more likely to be correct because of the easier abstraction, and the internal synchronization is more likely to be correct, then it's more likely that my code will be correct.

        I don't do super low-level stuff at all, most of my stuff ends up touching a network, so the small differences between the built-in synchronized structures vs the regular ones really don't matter since any small gains I'd get on that will be eaten the first time I hit the network, so a considerably higher ROI for me is almost always figuring out how to reduce latency.

        If I did C or C++, I'd probably have different opinions on this stuff.

      • foobazgt 2 days ago |
        Every abstraction is about outsourcing the thing it's abstracting away. If using a queue solves your problem, you no longer have to deal with all the headaches that you can run into using a bare mutex.
      • another-acct 2 days ago |
        > C, a useless language without effective scopes

        Mutexes can be handled safely in C. It's "just another flavor" of resource management, which does take quite a bit of discipline. Cascading error paths / exit paths help.

      • sgarland 2 days ago |
        > C, a useless language

        You misspelled “fast as fuck” and “lingua franca of all architectures.”

      • adgjlsfhk1 2 days ago |
        > Message passing is just outsourcing the lock, right?

        Kind of. If you can architect such that each channel has exactly 1 reader and 1 writer, you can send messages in a single direction with no locks. The basic idea is that you have a circular buffer with a start index and an end index. The writer can write an element and increment the end index (as long as end index+1<start index which doesn't have to be done atomically), while the reader can just read an element and increment the start index (as long as start index +1 < end index). This strategy needs to use atomic operations (which are basically free when uncontested, which they will be as long as the queue has a few elements in it)

    • another-acct 2 days ago |
      > remove locks from code and replace with some kind of queue or messaging abstraction

      Shared-nothing message passing reflects the underlying (modern) computer architecture more closely, so I'd call the above a good move. Shared memory / symmetric multiprocessing is an abstraction that leaks like a sieve; it no longer reflects how modern computers are built (multiple levels of CPU caches, cores, sockets, NUMA, etc).

      • tombert 2 days ago |
        That's good to hear! I am pretty removed from underlying hardware now, so it makes me happy to hear that better way of doing things is catching on even in low-level land.
      • gpderetta 2 days ago |
        If you are doing pure shared nothing message passing, you do not need coherent caches; in fact cache coherency gets in the way of pure message passing.

        Viceversa if you do pure message passing you are not benefitting from hardware accelerated cache coherency and leaving performance (and usability) on the floor.

    • intelVISA 2 days ago |
      Shared-nothing is typically The Right Choice in my experience as well. Maybe the odd atomic...
  • alberth 2 days ago |
    Are there any Linux distro's built/using Cosmo?

    (like Alpine use of musl)

    • gavindean90 2 days ago |
      Not yet
  • forrestthewoods 2 days ago |
    Hrm. Big fan of Justine and their work. However this is probably the least interesting benchmark test case for a Mutex. You should never have a bunch of threads constantly spamming the same mutex. So which mutex implementation best handles this case isn’t particularly interesting, imho.
    • jart 2 days ago |
      What do you consider a good benchmark test case for mutexes?
      • pizlonator 2 days ago |
        Very large multithreaded programs with lots of diverse uses of locks, including:

        - uncontended, mildly contended, and horrifically contended

        - short and long critical sections

        - contention among small numbers of threads and large numbers of threads

        - contention that happens when other locks are held recursively and those are also contended on (like, thread A wants a lock held by thread B, but thread B is trying to get a lock held by thread C)

        Different lock algos work well or badly depending on how the locks are used, so it’s important to pick real programs as your benchmark rather than trying to cook a synthetic benchmark.

    • convolvatron 2 days ago |
      what else would you measure? certainly the uncontended case is important and a baseline, but otherwise this is kind of weak point for mutexes - that if you don't handle contention well then you have idle hardware or lots of additional scheduler work or kernel crossings.

      [edit - I forget to even mention one of the most important things, that locks that reform poorly under contention can have really negative systemic effects like hot spotting the memory network, and that would show up here too]

      • tialaramex 2 days ago |
        Uncontended is crucial. If you want to benchmark other things that's excellent, but if MutexA has crap uncontended performance then I'm on a loser if we pick MutexA unless I am absolutely sure we will have a lot of contention. Since contention is never desirable, that's rare.

        Think of this like the random input case for a sort benchmark. Do I want stuff like all-ascending, all-descending, zig-zag and so on? Sure, those are nice. But without the random input case the benchmark is not very helpful. I might sort a zig-zag, I might sort data that's already in ascending order, but I will sort random data, that's going to happen or else I would not need a sort function.

        • jart 2 days ago |
          Uncontended is uninteresting, because all mutex implementations perform roughly the same here, give or take a nanosecond or two. If you're truly uncontended then a naïve spin lock will actually seem fastest, because xchg is faster than cmpxchg which is needed for good locks.
          • loeg 2 days ago |
            Uh, why do you say a naive spin lock would use xchg instead of cmpxchg? I don't think you could make a valid spinlock using xchg.
            • jart 2 days ago |
              On x86 you can. When xchg is used with a memory parameter it locks the bus. This is true even in the absence of a lock prefix. I included a spinlock implementation in the blog post. If you see any errors with it, then please let me know!
              • loeg 2 days ago |
                Oh, sure, your 1-bit spinlock with no other state works.
    • cogman10 2 days ago |
      > You should never have a bunch of threads constantly spamming the same mutex.

      I'm not sure I agree with this assessment. I can think of a few cases where you might end up with a bunch of threads challenging the same mutex.

      A simple example would be something like concurrently populating some data structure (list/dict/etc). Yes, you could accomplish this with message passing, but that uses more memory and would be slower than just having everything wait to write to a shared location.

      • kccqzy 2 days ago |
        > would be slower than just having everything wait to write to a shared location

        Nope.

        • cogman10 2 days ago |
          Yup.

          Message passing has allocation pressure and cache consistency pressure not present in using a shared message location. Especially as the amount of memory in question goes up, the benefit of a shared location increases in terms of the performance impact.

          Sure, for something silly like writing to an int, there is negative benefit in a shared location, but when you start talking about a dictionary with 1 million entries, the shared location becomes much more of a benefit vs all the copying and allocating you'd have to do if you tried to do the same thing with message passing.

          For some datastructures, it's the optimal way to move data around. For example, LMAX disruptor is about the fastest way to pass messages because of the shared memory and well tuned locks.

          • kccqzy 2 days ago |
            You are talking about writes to a data structure such as a list or a dictionary from multiple threads. Nobody uses advanced message passing techniques for that. A list is basically the poster child of why you avoid mutexes: each thread writes to its own version of a sublist, with no use of mutexes, and then at the end of the processing every thread's lists are merged together. Merging a list takes O(1) time by manipulating a few pointers. You avoid mutexes and there's zero drawback. I don't know why you are talking about copying or allocating: a list requires no copying to be merged, and in this case all the allocations still happen over multiple threads so there's no change in allocation pressure. If your allocator is bad, there could be internal mutexes inside the allocator, but that's besides the point.

            With dictionaries, you may have to do a bit of resizing and rehashing at the end. But in my benchmarks, it is still worthwhile.

            • cogman10 2 days ago |
              > Merging a list takes O(1) time

              Depends on the list implementation. I assume we are talking C++ lists? in that case yeah, I can see how you'd instead do a `splice` of sublists.

              I was thinking more in terms of something like a `vector` in which case adding the sublists in requires copying values from one vector to the source vector. That (especially if done wrong) can involve allocating potentially more than once to drop the results in.

              But, for the record, there are lock free algorithms that don't require a mutex to add values into a list. You basically CAS the tail pointer with your new value.

              That being said, I concede that for a list a mutex is probably the wrong choice. It's a better choice in Java which has constraints that prevent (easily) splicing together 2 lists.

              For the dictionary, implementation is key. A good segmented dictionary is going to be hard to beat.

      • forrestthewoods 2 days ago |
        If you’re trying to mutate a dictionary many times from many threads you’re going to have a bad time. The fix isn’t a faster mutex, it’s don’t do that.
        • cogman10 2 days ago |
          Depends on the dictionary implementation. There's a number of thread safe dictionaries in the wild with varying degrees of parallelism performance. Pretty much all of them benefit from faster mutexes.

          For example, some thread safe dictionaries will segment their underlying key/value pairs which allows them to have concurrent reads and writes for a given segment which significantly improves performance.

          • forrestthewoods 2 days ago |
            > Pretty much all of them benefit from faster mutexes.

            Faster doesn't mean fast. You really, really want to not use a thread-safe dictionary.

    • Salgat 2 days ago |
      I'd say the vast majority of cases where I use a lock/semaphore is around very expensive resources, where the utilization of that resource vastly outweighs any performance overhead of the lock.
      • another-acct 2 days ago |
        This is how it should be. IIRC -- apologies, can't find a source --, Ulrich Drepper wrote somewhere about NPTL that its mutexes were not particularly lightweight, but that you should design your program for low contention anyways.

        For highly contended data structures, spinlocks (and nowadays explicit atomics) are likely better.

  • joelthelion 2 days ago |
    Could established libcs adopt this? If not, why not?
    • loeg 2 days ago |
      Maybe. nsync is Apache licensed.
  • pizlonator 2 days ago |
    Always cool to see new mutex implementations and shootouts between them, but I don’t like how this one is benchmarked. Looks like a microbenchmark.

    Most of us who ship fast locks use very large multithreaded programs as our primary way of testing performance. The things that make a mutex fast or slow seem to be different for complex workloads with varied critical section length, varied numbers of threads contending, and varying levels of contention.

    (Source: I wrote the fast locks that WebKit uses, I’m the person who invented the ParkingLot abstraction for lock impls (now also used in Rust and Unreal Engine), and I previously did research on fast locks for Java and have a paper about that.)

    • PaulDavisThe1st 2 days ago |
      To add to this, as the original/lead author of a desktop app that frequently runs with many tens of threads, I'd like to see numbers on performance in non-heavily contended cases. As a real-time (audio) programmer, I am more concerned with (for example) the cost to take the mutex even when it is not already locked (which is the overwhelming situation in our app). Likewise, I want to know the cost of a try-lock operation that will fail, not what happens when N threads are contending.

      Of course, with Cosmopolitan being open source and all, I could do these measurements myself, but still ...

      • ataylor284_ 2 days ago |
        The writeup suggests this implementation is optimized for the not-locked case:

        > uses an optimistic CAS (compare and swap) immediately, so that locking happens quickly when there's no contention

      • pizlonator 2 days ago |
        Totally!

        Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.

        Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.

        Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).

        • PaulDavisThe1st 2 days ago |
          We use spinlocks where appropriate. In the 90s I recall that the general rule of thumb was if the lock is held for <10x the context switch time, spinlocks are generally a better choice. Not sure if that's still true of contemporary architectures.

          The more common pattern in rt/audio code is "try to take the lock, but have an alternate code path if that fails". It's not that is never going to be contention, but it will be extremely rare, and when it occurs, it probably matters. RWLocks are also a common pattern, with the RT thread(s) being read-only (and still being required to fall back on an alternate code path if the read lock cannot be taken).

          • pizlonator 2 days ago |
            These days, fast lock implementations use the following rough idiom, or some idiom that is demonstrably not any slower even for short critical sections.

                if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
                for (unsigned i = 0; i < 40; ++i) {
                    /* spin for a bounded a mount of time */
                    if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
                    sched_yield();
                }
                ... /* actual full lock algo goes here */
            
            So, the reason to use spinlocks isn't that they are faster for short critical sections, but that they don't have to CAS on unlock - and so they are faster especially in the uncontended case (and in all other cases too).

            In other words, the modern rule of thumb is something like: if you're going to grab the lock so frequently that the uncontended lock/unlock time shows up as a significant percentage of your execution time, then use a spinlock. That probably implies that you are probably holding the lock for <100x or even <1000x context switch time, or something around there.

            • adastra22 2 days ago |
              Huh, interesting. TIL. Thanks for that!
              • xxs 2 days ago |
                Lots of code is 'lock free' and uses the same pattern (more or less). Attempt the change optimistically, if it fails spin it, if it keeps failing (Massive Contention), back off.
                • adastra22 2 days ago |
                  A properly lock-free architecture wouldn’t have spin-locks, no?
                  • pjdesno a day ago |
                    A properly lock-free architecture is a spin-lock :-)
                    • xxs 18 hours ago |
                      spin locks prevent forward grantees, e.g. when the thread is scheduled out (or serves an interrupt), no other thread can make progress. Lock free allows progress - there is no exclusivity. Of course if working on the kernel and controlling the scheduler, etc. the restriction does not apply immediately.

                      Realistically though, lots of lock free employs copy-on-write or rcu.

                  • xxs 18 hours ago |
                    of course, no lock.

                    I have written quite the fair share of lock free code (back then when it was new), and generally enjoy writing it (to this day). Yet, for most folks it's just too hard to reason/understand/maintain it.

                    Just a note, lock free is a lesser requirement than wait-free.

            • loeg a day ago |
              > if you're going to grab the lock so frequently that the uncontended lock/unlock time shows up as a significant percentage of your execution time, then use a spinlock.

              Yeah, and maybe also consider changing your design because usually this isn't needed.

              • pizlonator a day ago |
                This is (in my experience) a byproduct of good design, so changing the design wouldn't be a great idea.

                Every time I've seen this happen it's in code that scales really well to lots of CPUs while having a lot of shared state, and the way it gets there is that it's using very fine-grained locks combined with smart load balancing. The idea is that the load balancer makes it improbable-but-not-impossible that two processors would ever want to touch the same state. And to achieve that, locks are scattered throughout, usually protecting tiny critical sections and tiny amounts of state.

                Whenever I've gotten into that situation or found code that was in that situation, the code performed better than if it had coarser locks and better than if it used lock-free algorithms (because those usually carry their own baggage and "tax"). They performed better than the serial version of the code that had no locks.

                So, the situation is: you've got code that performs great, but does in fact spend maybe ~2% of its time in the CAS to lock and unlock locks. So... you can get a tiny bit faster if you use a spinlock, because then unlocking isn't a CAS, and you run 1% faster.

          • spacechild1 2 days ago |
            Keep in mind that while try_lock() is realtime-safe, the following unlock() may not, as it may need to wake threads that have been blocked in the meantime!

            So I would only use this pattern for situations where the very fact that a NRT thread tries to acquire the lock already means that RT safety is not a concern anymore (e.g. a device has been disconnected)

            • PaulDavisThe1st a day ago |
              Excellent reminder. Yes, this is both a design defect of try/unlock that can't really be solved, and an implementation defect on several platforms that makes it worse than it actually needs to be (Windows, I'm looking at you)
              • spacechild1 14 hours ago |
                One thing I've been using is a spinlock where the RT thread only uses try_lock() + fallback path and the NRT thread(s) call try_lock() in a loop with pause instructions and occasional thread yielding (or even sleeping). This might waste lots of CPU cycles when a NRT thread tries to acquire a lock that is already held by the RT thread, but assuming this only happens occasionally/rarely, it's a reasonable trade-off.
          • xmodem 2 days ago |
            What type of scenarios come up in RT/audio code where an alternative path is available if you can't get a lock? In most of the concurrent code I've written the options available if I can't get a lock is some subset of try again, try again with backoff, or fail.
            • Archit3ch 2 days ago |
              Let's say you are doing audio processing in the real-time thread. You try to lock to read new parameters, and if you fail to get the lock you do the processing with the old parameters.
        • lilyball 2 days ago |
          On Darwin, it's possible for a pure spinlock to produce a priority inversion deadlock, because Darwin has a quality of service implementation in the kernel that differs from how everyone else handles thread priority. In other kernels, a low-priority thread will still eventually be guaranteed a cpu slice, so if it's holding a spinlock, it will eventually make progress and unlock. On Darwin with Quality of Service, it's possible for higher-QoS threads to preempt lower-QoS threads indefinitely.

          For this reason, on Darwin you want to avoid spinlocks unless you know that all threads touching the spinlock are always running in the same QoS. Instead of spinlocks, your go-to for low-overhead locks there is os_unfair_lock, which is a spinlock variant that donates priority of the blocked threads to the running thread.

          • pizlonator 2 days ago |
            I’ve shipped code on Darwin that spinlocks and gets away with it without any noticeable cases of this happening.

            I know it can happen in theory. But theory and practice ain’t the same.

            I worked for Apple when I shipped this too lmao

            • ninkendo 2 days ago |
              Also worked for Apple for 13 years: there is a lot (and I mean a LOT) of priority inversions and abuse of locks in the codebase. Breaking out the internal profiler and actually having a look at whether you were running at the priority you thought can be a very illuminating thing.

              One particular app had a particularly nasty habit of losing priority because most of its work was done in a set of daemons and not in the UI process itself. The foreground app is allowed to use Performance cores and receive USER_INTERACTIVE priority for operations if it wants, but daemons cannot unless they have a mach IPC voucher stating that the UI is blocked on their work. But if you do the wrong thing with a semaphore or a spinlock, you get booted out of the thread group and now you’re running on Efficiency cores at USER_INITIATED priority and suddenly your latency numbers spike 10x on systems with asymmetric cores. I remember solving a particularly bad one which was basically doubling our request latency on phones with E cores.

              The company had a pretty widely spread internal article trying to whack people over the head to stop abusing lock primitives in this way, but we just kept shipping code with this issue.

              (Most of the inversion issues came from the use of semaphores though. The problem got 1000x worse in swift concurrency because using semaphores blocks an entire SC OS thread and you only get as many of those as you have cores. Most Apple Watches and HomePod only have 2 of them so you can easily deadlock. Which most people fixed by just… adding a timeout to their locks. It was horrifying.)

              • scottlamb a day ago |
                > The company had a pretty widely spread internal article trying to whack people over the head to stop abusing lock primitives in this way, but we just kept shipping code with this issue.

                It sounds like this is relevant when developing user app code, not just the kernel or core libraries. Is there an external version of this article? I would be very interested to read more.

                • ninkendo a day ago |
                  TL;DR: Never, ever, ever, ever do this:

                      let valueIWant: String? = nil
                      let sem = DispatchSemaphore(value: 0)
                      someAsyncFunction() { result in
                          // this is called in another thread
                          valueIWant = result
                          sem.signal()
                      }
                      sem.wait()
                      assert(valueIWant != nil)
                  
                  Note, this is mostly about semaphores, so it’s a bit offtopic from the original discussion here which is about spinlocks. But basically, never lock something in expectation of another thread unlocking it. Since semaphores are ownerless, the scheduler has no idea what thread will eventually signal the semaphore, so it can’t boost the priority of any threads you’re waiting on.

                  Not to mention the thread explosion issue: If the task running the completion (ie. signal()) is submitted to a libdispatch queue, the thread pool implementation will spawn more threads if all of them are starved, to eliminate deadlocks, which can lead to thread explosion in addition to the priority inversion issue. This advice dates back to when libdispatch was invented (and the internal article was called The Semaphore Antipattern…)

                  But with Swift Concurrency it’s far, far worse, because the SC thread pool does not spawn new threads if all of them are blocked. Meaning the dispatch block that’s supposed to call the callback may never get run, and you deadlock. Most importantly, this is true even of plain ObjC libdispatch code written 15 years ago: If some ancient framework you’re calling into does the semaphore antipattern, libdispatch is allowed to assume that all Swift Concurrency threads will eventually make forward progress, and so it doesn’t spawn new threads to handle the work, even if all threads are blocked. So ancient code that uses this antipattern suddenly blows up if someone calls it (transitively!) from a swift concurrency context. I’ve squashed countless bugs on this issue alone. (I wrote my own article on this internally too, titled something like “Semaphore abuse: How to deadlock your process and require a device reboot”)

                  IMHO, the semaphore should be officially marked deprecated and trigger warnings… there are no appropriate uses for it. If you’re in a synchronous context and need to block on the result of async code, the best bet is to stop what you’re doing and file a bug to the people writing the async code and tell them to give you a synchronous API (or refactor your code to be async.) Never ever use semaphores to block on async code. Not even once.

          • namibj 2 days ago |
            On Linux you can still easily go extremely slow with plain sched_yield() as there's no guarantee the thread who holds the lock will get to run, especially in numa situations due to a strong bias against migrating threads across numa boundaries. You have to use the e.g. futex API to tell the kernel what you're waiting for.

            Bare spinlocks are very bad in this way, unless you have dedicated hardware threads (with SMT you can pin all but one of the threads with the remaining thread being general purpose, though you may want a way to inform the scheduler which cores to currently prefer due to their pinned tasks being more idle).

        • ot 2 days ago |
          > Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.

          This is something I've thinking about a lot over time, that the CAS is only there to atomically determine if there are any sleeping waiters on unlock and you have to do a futex_wake. I would really want some way to get away with non-fenced operations (at least on x86), but I don't know if it's just that nobody has figured out why, or there is a fundamental reason why that's not possible.

          • pizlonator 2 days ago |
            You do need a fence in the unlock path though (at least a release fence).

            I think the issue is that if you ask the CPU to just store something (like in a spin lock), whether or not there’s a fence, it’s an operation with limited data flow dependencies so it’s easy for the CPU to execute. Even the fence can be handled using wacky speculation tricks.

            But if you want to do something like, “store this value but only if the old value satisfies some predicate”, then there’s a load and the whole thing is dependent on the load. So you’re asking the CPU to load, then run a predicate, then store, and for that to be fenced, and atomic.

            Strictly more work. I don’t think there’s any trick to make it faster than just the store release.

            • ot 2 days ago |
              > You do need a fence in the unlock path though (at least a release fence).

              Well yes but on x86 that comes for free. The overhead of the full fence brought in by lock cmpxchg or lock xchg is in the order of ~10ns, which for an uncontended lock means that a mutex is almost 2x as slow as a spinlock.

              A load acquire + store release would be a couple of ns (assuming everything in L1 etc...)

          • gpderetta 2 days ago |
            As far as I know it is a fundamental limitation. You need to release the mutex, then check that there were no waiters in this order not to miss wakeups. As the mutex release must be globally visible before the load, release ordering on the mutex is not sufficient as the load could be reordered before the unlock; hence you need a StoreLoad fence which is always the most expensive barrier.

            Consider the implementation of Dekker's algorithm for example.

            • gpderetta 2 days ago |
              Also this paper might be relevant: Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated [1]

              [1] https://www.cs.bgu.ac.il/~hendlerd/papers/p168-expensiveSync...

            • gpderetta 2 days ago |
              As a more practical argument, let's suppose x86 had a an atomic CAS that guarantees that the store is released but the load is relaxed (unlike other x86 normal loads, but like non-temporal loads, it has no implied LoadLoad or LoadStore like other x86 loads).

              This relaxed-load-CAS, coupled with some form of control or data dependency would be sufficient to implement your mutex. But would such a CAS be significantly cheaper than the existing CAS? If it was, you would be able to approximate the strong CAS semantics by adding lfence+sfence after the relaxed CAS. These fences are cheap, so if strong CAS was possible to be implemented this way with significant improvements, intel would have already done it.

              Finally, it is practically possible to implement the sequence store(A);#StoreLoad;load (B) without an explicit fence by using the colocation trick: have A and B be adjacent in memory (on the same cacheline), store to A, then do a wider load on both A and B. Intel does not give multithread guarantees on this, but my understanding is that it works: the wider load fails to be store-forwarded and stalls the pipeline waiting for the previous store to be flushed. In practice this costs about as much as a fence, so in addition to being undefined, is not cheaper.

          • davidtgoldblatt 2 days ago |
            There's two I've tried to do this:

            - On the wait side, do the CAS to set the "waiter present" bit. Down unlock, do a (relaxed) read of the lock word, and if "waiter present" isn't set, just do a release store to unlock (and go down some slow CAS-y wake path if a waiter is present). On the wait side, never do an un-timed futex wait; just do a series of timed waits, with increasing wait times (so that you still eventually fix things if you hit the unlucky race between the previous holder's check+store sequence). (You can also do some tricks with counters to let waiters do an unconditional sleep once they get their wait acknowledged).

            - Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

            Practically, neither was ever "better enough" for anything but microbenchmarks to be worth the complexity.

            • gpderetta 2 days ago |
              > Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

              If you are doing this, given the cost of membarrier, you are optimizing for almost always uncontended locks. At this point you can make your lock default-owned by the first thread to lock it and have the owner lock and unlock be basically free until it is contended. This is basically the biased locking optimization that Java implements (or used to).

              • davidtgoldblatt a day ago |
                It kinda depends; you only do the membarrier when you're about to sleep anyways, and the non-expedited membarrier() call is just a synchronize_rcu(), so it's not that drastically more expensive than a futex wait.

                You don't necessarily want a biased lock for all this kind of stuff, because "sparsely contended" doesn't necessarily imply thread-associated. E.g. one place I was looking at this for was locks for pages of virtual memory in a heap; no thread "owns" any given heap page, but it was very uncommon to get unlucky and have two threads touching adjacent pages at the exact same time. These kind of "sloppy mutexes" get half the fast-path speedup of biased locks but without the heavily asymmetric performance costs. (At least, that was the theory; like I said it didn't really pan out to be that useful in practice).

        • Animats 2 days ago |
          > just make sure you `sched_yield` before each retry

          Assuming `sched_yield` does something.

          There's a futex congestion problem inside Wine's memory allocator. There are several levels of locks. If you're growing a buffer, in the sense of C's "realloc", and no buffer is available, memory allocation is locked during the allocation of a bigger buffer, copying of the contents, and release of the old buffer. "Push" type operations can force this. Two orders of magnitude performance drops ensue when multi-threaded programs are contending for that lock.[1]

          Inside one of the lock loops is a call to "YieldProcessor".

              static void spin_lock( LONG *lock )
              {
                   while (InterlockedCompareExchange( lock, -1, 0 ))
                       YieldProcessor();
              }
          
          But the actual code for YieldProcessor is a NOP on x86:[2]

              static FORCEINLINE void YieldProcessor(void)
              {
                  #ifdef __GNUC__
                  #if defined(__i386__) || defined(__x86_64__)
                       __asm__ __volatile__( "rep; nop" : : : "memory" );
                  #elif defined(__arm__) || defined(__aarch64__)
                      __asm__ __volatile__( "dmb ishst\n\tyield" : : : "memory" );
                  #else
                      __asm__ __volatile__( "" : : : "memory" );
                  #endif
                  #endif
              }
          }

          Wine devs are aware of this, but the mess is bad enough that no one has tackled it. This is down in the core of what "malloc" calls, so changes there could have unexpected effects on many programs. Needs attention from someone really into mutexes.

          [1] https://forum.winehq.org/viewtopic.php?t=37688

          [2] https://gitlab.winehq.org/wine/wine/-/blob/HEAD/include/winn...

          • pizlonator 2 days ago |
            sched_yield isn’t a nop
          • secondcoming 2 days ago |
            `rep; nop;` is actually the `pause` instruction. On older CPUs it’s a standard nop, but on newer CPUs it’s a more efficient nop.

            Spinning on the CMPXCHG is also a bad idea. You should spin on the read and only then attempt the CMPXCHG.

            • bobmcnamara 2 days ago |
              Bingo. Spinning on CMPXCHG can cause livelock.
            • epcoa 2 days ago |
              Just to clarify: the spinning on CMPXCHG is not also a bad idea, the YieldProcessor is correct (a pause), but inside the CMPXCHG loop it should be spinning on a pure unlocked load. Is that correct?
              • secondcoming 2 days ago |
                No you should spin on a read. Once you see the value you want you then try the CMPXCHG. If that succeeds you exit. If it fails you go back to spinning on the read.
                • epcoa 2 days ago |
                  What is the difference between a read and a “load” here?
                  • loeg a day ago |
                    Read and load mean the same thing. (I think GP just missed the end of your comment.)

                    You care about exchange vs read/load because of cache line ownership. Every time you try to do the exchange, the attempting CPU must take exclusive ownership of the cacheline (stealing it from the lock owner). To unlock, the lock owner must take it back.

                    If the attempting CPU instead only reads, the line ownership stays with the lock holder and unlock is cheaper. In general you want cache line ownership to change hands as few times as possible.

              • pizlonator 2 days ago |
                Spinning with pause is slower than spinning with sched_yield according to every test I’ve ever done
                • epcoa 2 days ago |
                  For what it is worth it seems the library in question does both, uses an exponential retry loop, busy read looping 2^i times for the first 7 attempts before then yielding. It seems like there must be some threshold where latency is improved by retrying before yielding, but I don’t test these things for a living.

                  https://github.com/google/nsync/blob/c6205171f084c0d3ce3ff51...

                  • pizlonator a day ago |
                    My data says that always yielding is better than ever busy spinning, except on microbenchmarks, where depending on the microbenchmarks you can get any answer your heart desires.
                  • Animats a day ago |
                    I've tried to figure out where that goes wrong. Read the bug report linked. I've caught this situation in gdb with 20 threads in spinlocks inside realloc, performance in a game down to 1 frame every 2 seconds, and very little getting done. But I don't understand how the three levels of locking involved cause that. Nor do the Wine people.

                    Works fine on Microsoft Windows. Only fails on Wine.

                    • epcoa a day ago |
                      It seems like the loop around InterlockedCompareExchange is a bad idea since this is a bus lock in a tight loop. Rather the inner spinning loop that is yielding should just be reading the value surrounded by the cmpxchg. As for whether sched_yield should just be called in the inner loop or a short nop/pause loop should be attempted for microcontention reasons, the expert opinion here is don't bother with the nop loop. However while the nop loop might not be real world optimal I doubt that would be causing a catastrophic performance issue.
          • spacechild1 2 days ago |
            `YieldThread` is just named confusingly. It is not equivalent to `sched_yield` or `std::thread::yield()`, it's rather a macro to issue a pause instruction (which is indeed what you typically want in a spin loop). The actual Windows equivalent would be `SwitchToThread`.
          • xxs 2 days ago |
            if YieldProcessor() actually did switch it'd be awfully expensive, like mentioned "rep; nop", is not just a "nop".

            OTOH, the usual way to lock via a busy loop/spin, does require some attempts then backoff with a random duration.

          • jart 2 days ago |
            Has WINE considered using mremap() for its realloc() implementation? It allows you to make realloc() of a sufficient size basically cost nothing. See this screenshot https://x.com/JustineTunney/status/1837663502619889875 On other platforms like MacOS you can achieve the same thing as mremap() by mapping to random addresses, and then using MAP_FIXED to append.
          • cesarb a day ago |
            > But the actual code for YieldProcessor is a NOP on x86:[2]

            > __asm__ __volatile__( "rep; nop" : : : "memory" );

            It might look like a NOP, but "REP NOP" is not a real NOP, it's the PAUSE instruction (which very old processors from the 1990s treated as a normal NOP).

          • exikyut 4 hours ago |
            I know NQP here isn't Not Quite Perl, but I'm not sure what it *is*. Seeking enlightenment!
        • wbl 2 days ago |
          Atomic RMW is all you need. Imagine 3 state futex lock of taken, sleeping, and unlocked. Waking thread just needs to see old value but writes unlocked unconditionally.

          I don't think this is more expensive than release certainly not because of the line where the write goes in most cache coherency protocols.

          Granted this lock does wake up too many people but you did say usually uncontended.

          • pizlonator 2 days ago |
            And what, unconditionally futex wakes everyone?

            That'll be hella slow

            • wbl a day ago |
              You did say uncontended!
              • pizlonator a day ago |
                So you want uncontended unlocks to make a syscall to wake the threads that aren’t there?

                That syscall will be worse than a CAS just because it’s a syscall. Not to mention it’ll have to lock some kernel data structure just to figure out that there aren’t any threads to wake.

                • wbl 9 hours ago |
                  I think I'm not doing a great job of explaining it.

                  There are three states of the mutex: locked=1, unlocked=0, and sleeping=2. A thread comes along to a locked mutex, and atomically moves it to sleeping and sleeps via futex. The freeing thread unconditionally writes 0 and atomically sees the old value: if 2 wake all, if 1 no need to wake anybody.

                  Maybe that is equally expensive to a CAS, but I don't think it ultimately is.

                  • pizlonator 7 hours ago |
                    “Atomically sees the old value” means you’re doing an atomic RMW, which is more expensive than just a store with a release fence.

                    Atomic RMW is almost as expensive as CAS. Exactly as expensive as CAS on some cpus.

        • spacechild1 2 days ago |
          > If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry

          Shouldn't you rather use pause instructions (_mm_pause, __yield, ISB, etc.) instead of thread switching? That's what I have seen in most spin lock implementations. Maybe it depends on the use case?

          • xxs 2 days ago |
            start with few attempts w/o pause, just busy CAS, then CAS + pause, then yield (and backoff/sleep with random duration)
            • senderista a day ago |
              No, you shouldn’t be calling CAS in a tight loop, but rather a relaxed load to check if a CAS might be successful, with PAUSE equivalent after each load. After spinning for ~context switch latency you should back off with sched_yield before retrying the spin loop (and eventually start sleeping on backoff with exponential jitter etc).
              • xxs a day ago |
                of course, there is very rarely any a naked CAS
        • harrison_clarke 2 days ago |
          biased locking can be faster than spinlocking, in the uncontended case

          unbiasing/re-biasing a lock can be quite heavy, though. JVM did it by pausing threads (or waiting for GC to do so)

      • fulafel 2 days ago |
        I'm sure you know but this is the opposite of what real-time conventionally means (you have hard deadlines -> worst case is the chief concern).
    • uvdn7 2 days ago |
      I was thinking the same. There are many mutexes out there, some are better at certain workloads than the rest. DistributedMutex and SharedMutex come to mind (https://github.com/facebook/folly/blob/main/folly/synchroniz..., https://github.com/facebook/folly/blob/main/folly/SharedMute...) Just like hashmaps, it's rarely the case that a single hashmap is better under _all_ possible workloads.
      • pizlonator 2 days ago |
        Yeah.

        I should say, though, that if you're on Windows then I have yet to find a real workload where SRWLock isn't the fastest (provided you're fine with no recursion and with a lock that is word-sized). That lock has made some kind of deal with the devil AFAICT.

    • ngoldbaum 2 days ago |
      This style of mutex will also power PyMutex in Python 3.13. I have real-world benchmarks showing how much faster PyMutex is than the old PyThread_type_lock that was available before 3.13.
      • miohtama 2 days ago |
        Any rough summary?
      • 0xDEADFED5 2 days ago |
        Can I use PyMutex from my own Python code?
        • ngoldbaum 2 days ago |
          No, it wouldn’t make sense to. Use threading.Lock for that. PyMutex is available in the CPython C API.
      • kagrt 2 days ago |
        I wonder how much it will help in real code. The no-gil build is still easily 50% slower and the regular build showed a slowdown of 50% for Sphinx, which is why the incremental garbage collector was removed just this week.

        Python development is in total chaos on all social and technical fronts due to incompetent and malicious leadership.

        • plesner a day ago |
          I'm very ready to believe your description of the state of python is true but I've been out of the loop on python for a while. I'm interested in more details. Can you expand or point to any articles that give more details?
    • intelVISA 2 days ago |
      Fast locks are an oxymoron vs optimized CAS
      • xxs 2 days ago |
        Indeed, b/c locks need to have their CAS.
    • ot 2 days ago |
      Yeah, that specific benchmark is actually likely to prefer undesirable behaviors, for example pathological unfairness: clearly the optimal scheduling of those threads runs first all the increments from the first thread, then all of the second thread, etc... because this will minimize inter-processor traffic.

      A mutex that sleeps for a fixed amount (for example 100us) on lock failure acquisition will probably get very close to that behavior (since it almost always bunches), and "win" the benchmark. Still, that would be a terrible mutex for any practical application where there is any contention.

      This is not to say that this mutex is not good (or that pthread mutexes are not bad), just that the microbenchmark in question does not measure anything that predicts performance in a real application.

      • pizlonator 2 days ago |
        Yeah! For all we know this performs great on real programs.

        And for all we know it’s absolute trash on real programs.

    • lovidico 2 days ago |
      Definitely a microbenchmark and probably wouldn’t be generally representative of performance. This page gives pretty good standards for OS benchmarking practic, although admittedly geared more for academia https://gernot-heiser.org/benchmarking-crimes.html
  • gok 2 days ago |
    Consider adopting `os_sync_wait_on_address()` on Darwin for your futex needs
    • jart 2 days ago |
      I've used that. It's just as good as ulock although relatively new. The issue is that using this API makes cancelation points no longer atomic. SIGTHR needs to be able to know the exact instruction in memory where an asynchronous signal is delivered when interrupting a wait operation and that's not possible if it's inside an opaque library.
  • jonathrg 2 days ago |
    > It's still a new C library and it's a little rough around the edges. But it's getting so good, so fast, that I'm starting to view not using it in production as an abandonment of professional responsibility.

    What an odd statement. I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.

    • tredre3 2 days ago |
      I'd like to point out that Justine's claims are usually correct. It's just her shtick (or personality?) to use hyperbole and ego-stroking wording. I can see why some might see it as abrasive (it has caused drama before, namely in llamacpp).
      • jancsika 2 days ago |
        This case isn't abrasive, but it's certainly incoherent.

        Name a single case where professional responsibility would require C code advertised as "rough around the edges" to be used anywhere near production. (The weasel words "starting to" do not help the logic of that sentence.)

        I can definitely understand how OP could view this as a red flag. The author should amend it.

      • another-acct 2 days ago |
        I also meant to comment about the grandstanding in her post.

        Technical achievement aside, when a person invents something new, the burden is on them to prove that the new thing is a suitable replacement of / improvement over the existing stuff. "I'm starting to view /not/ using [cosmo] in production as an abandonment of professional responsibility" is emotional manipulation -- it's guilt-tripping. Professional responsibility is the exact opposite of what she suggests: it's not jumping on the newest bandwagon. "a little rough around the edges" is precisely what production environments don't want; predictability/stability is frequently more important than peak performance / microbenchmarks.

        Furthermore,

        > The C library is so deeply embedded in the software supply chain, and so depended upon, that you really don't want it to be a planet killer.

        This is just underhanded. She implicitly called glibc and musl "planet killers".

        First, technically speaking, it's just not true; and even if the implied statement were remotely true (i.e., if those mutex implementations were in fact responsible for a significant amount of cycles in actual workloads), the emotional load / snide remark ("planet killer") is unjustified.

        Second, she must know very well that whenever efficiency of computation is improved, we don't use that for running the same workloads as before at lower cost / smaller environmental footprint. Instead, we keep all CPUs pegged all the time, and efficiency improvements only ever translate to larger profit. A faster mutex too translates to more $$$ pocketed, and not to less energy consumed.

        I find her tone of voice repulsive.

        • vitus 2 days ago |
          I agree overall with your sentiment but wanted to comment on one of your statements that I perceived to be hyperbole.

          > Second, she must know very well that whenever efficiency of computation is improved, we don't use that for running the same workloads as before at lower cost / smaller environmental footprint. Instead, we keep all CPUs pegged all the time, and efficiency improvements only ever translate to larger profit. A faster mutex too translates to more $$$ pocketed, and not to less energy consumed.

          It depends on the use case. If you can serve the same number of users / requests with fewer machines, then you buy and run fewer machines. (Yes, saving energy, but also saving on both capex and opex.)

          Also, when you're talking about anything resembling interactivity (as you might in the context of, say, a webserver), you really don't want to run anywhere close to 100% average utilization. With unbounded queues, you end up with arbitrarily high wait times; with bounded queues, you end up serving 503s and 429s and other server errors.

          That said, my experience with modern webservers is that you generally don't rely on mutexes for synchronizing most work across worker threads, and instead you try to keep your workloads as embarrassingly parallel as possible.

      • asveikau 2 days ago |
        She claimed posix changed for her use case, and it's not true, posix disallows what she said.

        Indeed the first few bullet points of her writeup on how the lock works (compare and swap for uncontended, futex for contended) is already how everybody implements locks for about 20 years since the futex was introduced for exactly this. Win32 critsec from even longer ago works the same way.

      • __turbobrew__ 2 days ago |
        The drama in llamacpp was not due to language, it was due to jart making false claims and having absolutely no idea how memory maps work in addition to needlessly changing file headers to contain her name in vanity.

        I don’t think anybody can deny that jart is a smart person, but I think there is more to it than just language abbrasiveness which people take issue with.

    • almostgotcaught 2 days ago |
      It's especially a red-flag since an enormous number of projects (almost all of them?) will never tolerate shipping fat binaries (ie what cosmopolitan C is in reality).
      • another-acct 2 days ago |
        Agreed; this is what I've always (silently) thought of those fat binaries. Absolute stroke of genius, no doubt, and also a total abomination (IMO) from a sustainability perspective.
      • ataylor284_ 2 days ago |
        The core of this a library called nsync. It appears most of the improvements by Justine are upstreamed into nsync itself which doesn't have any of the baggage of cosmopolitan. Whatever your opinion of the project or author, they've made good effort to not lock you in.
      • intelVISA 2 days ago |
        Ironic given the generous size of the average Go binary.
    • pmarreck 2 days ago |
      I feel that there’s a certain amount of hubris that comes along with spending long periods of time solo-coding on a computer, and perhaps unwittingly starved of social contact. Without any checks on you or your work’s importance (normally provided by your bog-standard “job”), your achievements take on a grandeur that they might not have broadly earned, as impressive as they might be.

      An example is APE (which I otherwise feel is a very impressive hack). One criticism might be “oh, so I not only get to be insecure on one platform, I can be insecure on many all at once?”

      The longer you spend in technology, the more you realize that there are extremely few win-wins and a very many win-somes, lose-somes (tradeoffs)

    • fefe23 2 days ago |
      Have you considered that you may have a different kind of humor than Justine?

      Why would you even post this here? Who do you think this is helping?

      • jonathrg 2 days ago |
        It doesn't clearly come across as a joke.
        • jart 2 days ago |
          It's a splash of dry humor on a personal blog in an information dense article.
          • jonathrg 2 days ago |
            OK, Poe's law at work then :)
        • int_19h 2 days ago |
          I think the domain name for her website is justine.lol for a reason.
      • another-acct 2 days ago |
        I think it's fair to comment not only on the subject, but on the writing itself, too.

        And it might help Justine improve her writing (and reach a larger audience -- after all, blog posts intend to reach some audience, don't they?). Of course you can always say, "if you find yourself alienated, it's your loss".

    • gr4vityWall 2 days ago |
      It came off as humor to me, at least.
      • orochimaaru 2 days ago |
        I agree. It seemed like humor to me. I have been accused of using off color humor with hyperbole before. So maybe the appeal is limited.
    • jay-barronville 2 days ago |
      > I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.

      In general, I agree with your sentiment, but Justine is simply a different beast. She does tend to use hyperbolic language a fair amount, but she delivers so much awesomeness that I make an exception for her.

    • kelnos 2 days ago |
      Justine seems like a fairly brilliant and creative person, but in production I don't think I'd want to use a libc that's "new" or "rough around the edges".

      Getting "so fast" is not my first priority when I run things in production. That's stability, predictability, and reliability. Certainly performance is important: better-performing code can mean smaller infrastructure, which is great from a cost and environmental perspective. But fast comes last.

  • stonethrowaway 2 days ago |
    > In 2012, Tunney started working for Google as a software engineer.[4] In March 2014, Tunney petitioned the US government on We the People to hold a referendum asking for support to retire all government employees with full pensions, transfer administrative authority to the technology industry, and appoint the executive chairman of Google Eric Schmidt as CEO of America.

    the absolute madman

    • Dansvidania 2 days ago |
      I wonder what they (Tunney) think of that now.
    • 01HNNWZ0MV43FF 2 days ago |
      Wild. Then again, in 2012 I was on a grippy sock vacation.
    • senderista a day ago |
      from a self-aggrandizing wikipedia article, I presume?
  • wallstprog 2 days ago |
    " I also managed to make contended nsync mutexes go 30% faster than nsync upstream on AARCH64, by porting it to use C11 atomics."

    Curious about this -- so what does C11 atomics use to implement? At least in Linux, C++11 atomics use pthreads (not the other way around).

    • rcxdude 2 days ago |
      It depends on what atomics. In principle most of them should map to an underlying CPU primitive, and only fallback to a mutex if it's not supported on the platform.
    • loeg 2 days ago |
      Atomics mostly map to underlying compiler / CPU intrinsics, not pthreads.
    • jeffbee 2 days ago |
      I am also curious about this and the ambiguity of "AARCH64". There are 64-bit ARM ISA versions without atomic primitives and on these what looks like an atomic op is actually a library retry loop with potentially unbounded runtime. The original AWS Graviton CPU had this behavior. The version of the ISA that you target can have significant performance impact.
    • jart 2 days ago |
      nsync has wrapper macros for all the various atomics libraries that prevented it from using two things.

      1. Weak CAS. nsync always uses strong CAS upstream to make the portability abstraction simpler. Being able to use weak CAS when appropriate helps avoid code being generated for an additional loop.

      2. Updating the &expected parameter. nsync upstream always manually does another relaxed load when a CAS fails. This isn't necessary with the C11 atomics API, because it gives you a relaxed load of the expected value for free when it fails.

      Being able to exploit those two features resulted in a considerable improvement in nsync's mu_test.c benchmark for the contended mutex case, which I measured on RPI5.

    • tialaramex 2 days ago |
      C++ insists on providing a generic std::atomic type wrapper. So despite my type Goose being almost four kilobytes, std::atomic<Goose> works in C++

      Of course your CPU doesn't actually have four kilobyte atomics. So, this feature actually just wraps the type with a mutex. As a result, you're correct in this sense atomics "use pthreads" to get a mutex to wrap the type.

      C++ also provides specializations, and indeed it guarantees specializations for the built-in integer types with certain features. On real computers you can buy today, std::atomic<int> is a specialization which will not in fact be a mutex wrapper around an ordinary int, it will use the CPU's atomics to provide an actual atomic integer.

      In principle C++ only requires a single type to be an actual bona fide atomic rather than just a mutex wrapper, std::atomic_flag -- all the integers and so on might be locked by a mutex. In practice on real hardware many obvious atomic types are "lock free", just not std::atomic<Goose> and other ludicrous things.

    • senderista a day ago |
      > At least in Linux, C++11 atomics use pthreads (not the other way around).

      I have no idea what you can possibly mean here.

      Edit: Oh, you must have meant the stupid default for large atomic objects that just hashes them to an opaque mutex somewhere. An invisible performance cliff like this is not a useful feature, it's a useless footgun. I can't imagine anyone serious about performance using this thing (that's why I always static_assert() on is_always_lock_free() for my atomic types).

  • bjourne 2 days ago |
    I made a benchmark on this last year when I didn't know how slow pthread mutexes were: https://stackoverflow.com/questions/76965112/why-are-pthread... For my use case, the mutex wait amounted to roughly 40% of the total runtime and spinlocks were way faster. Perhaps nsync or Cosmopolitan would have made my code much faster.

    I still believe the FUD around spinlocks is overstated. For "normal" hpc code the number of threads should be <= the number of cores. In that scenario spinlocking will minimize the latency which likely is what you care about the most. It beats having your performance ruined by dvfs.

    • gpderetta 2 days ago |
      If I'm understanding correctly, you measured pthread_lock as having about twice the overhead of a spin lock. As discussed elsethread, this is expected on x86 as a spinlock only needs an expensive CAS on lock and a cheap store on unlock, while a pthread_lock needs a CAS on both lock and unlock.
      • bjourne a day ago |
        Actually, in the micro benchmark the mutexes were five times slower than the spinlocks. In a real hpc application I reduced the runtime by 30-40% just by replacing them with spinlocks.
    • senderista a day ago |
      You can make spinlocks safe enough by backing off with sched_yield and timed sleeps, but at that point I'd probably rather use a ticket lock since I can roughly predict the wait time until my turn.
  • amiga386 2 days ago |
    If it's so good, why haven't all C libraries adopted the same tricks?

    My betting is that its tricks are only always-faster for certain architectures, or certain CPU models, or certain types of workload / access patterns... and a proper benchmarking of varied workloads on all supported hardware would not show the same benefits.

    Alternatively, maybe the semantics of the pthread API (that cosmopolitan is meant to be implementing) are somehow subtly different and this implementation isn't strictly compliant to the spec?

    I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...

    • jitl 2 days ago |
      > I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...

      is this sarcasm?

      (I don't know any libc maintainers, but as a maintainer of a few thingies myself, I do not try to implement state-of-the-art research, I try to keep my thingies stable and ensure the performance is acceptable; implementing research is out of my budget for "maintenance")

      • amiga386 2 days ago |
        But if you maintain a few thingies, you'd probably know about rival thingies that do a similar thing, right?

        If the rival thingies got a speed boost recently, and they were open source, you'd want to have a look at how they did it and maybe get a similar speed boost for yourself.

        • tialaramex 2 days ago |
          This is nowhere near as common as you seem to think, and mostly only happens for the narrow cases where somebody is obsessed with a particular problem so that they'd actually want to read other people's solutions. Most of the implementers do not have that sort of relationship to a problem they solved in the past.

          If in December you make a general purpose stable sort that's 25% faster than his, Orson Peters is probably going to read your code and try to apply ideas from it. But sorting is the thing Peters really cares about, the people who wrote the stable sort in say Microsoft's STL (their C++ standard library implementation) even if they still work there won't care enough to go back and change that unless told to do so.

        • jitl 2 days ago |
          It depends on the calculus about (time) budget and stability. Maybe I consider performance already acceptable, and don't have time budget to investigate beyond that. Maybe I look at "nsync", see its mutex (may) change the fairness semantics, and then decide not to adopt it because this may break my callers; or its enough that it may change the fairness semantics, and I don't have the budget to test nsync or a new implementation based on the nsync algorithm to determine if the semantics differ.
    • dist-epoch 2 days ago |
      Politics, not-invented-here syndrome, old maintainers.

      It takes forever to change something in glibc, or the C++ equivalent.

      There are many kinds of synchronization primitives. pthreads only supports a subset. If you are limiting yourself to them, you are most likely leaving performance on the table, but you gain portability.

    • aseipp 2 days ago |
      Those projects often have dozens of other priorities beyond just one specific API, and obsessing over individual APIs isn't a good way to spend the limited time they have. In any case, as a concrete example to disprove your claim, you can look at malloc and string routines in your average libc on Linux.

      glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability (it fragments badly and degrades over time, not to mention it has a dozen knobs that can deeply impact real life workloads like MALLOC_ARENA_MAX). musl malloc is completely awful in terms of performance at every level; in a multithreaded program, using musl's allocator will destroy your performance so badly that it nearly qualifies as malpractice, in my experience.

      musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks, and yes it absolutely shows up in non-trivial profiles, and yes improving this improves all programs nearly universally. glibc's optimized routines are good, but they can always, it seems, become faster.

      These specific things aren't "oh, they're hyper specific optimizations for one architecture that don't generalize". These two things in particular -- we're talking 2-5x wall clock reduction, and drastically improved long-term working set utilization, in nearly all workloads for any given program. These are well explored and understood spaces with good known approaches. So why didn't they take them? Because, as always, they probably had other things to do (or conflicting priorities like musl prioritizing simplicity over peak performance, even when that philosophy is actively detrimental to users.)

      I'm not blaming these projects or anything. Nobody sets out and says "My program is slow as shit and does nothing right, and I designed it that way and I'm proud of it." But the idea that the people working on them have made only the perfect pareto frontier of design choices just isn't realistic in the slightest and doesn't capture the actual dynamics of how most of these projects are run.

      • jart 2 days ago |
        > musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks

        Building GNU Make with Cosmo or glibc makes cold startup go 2x faster for me on large repos compared to building it with Musl, due to vectorized strlen() alone (since SIMD is 2x faster than SWAR). I sent Rich a patch last decade adding sse to strlen(), since I love Musl, and Cosmo is based on it. But alas he didn't want it. Even though he seems perfectly comfortable using ARM's strlen() assembly.

        > glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability

        The focus and attention I put into cosmo mutexes isn't unique. I put that care into everything else too, and malloc() is no exception. Cosmo does very well at multi-threaded memory allocation. I can pick benchmark parameters where it outperforms glibc and jemalloc by 100x. I can also pick params where jemalloc wins by 100x. But I'm reasonably certain cosmo can go faster than glibc and musl in most cases while using less memory too. You have Doug Lea to thank for that.

        Every day is a good day working on cosmo, because I can always find an opportunity to dive into another rabbit hole. Even ones as seemingly unimportant as clocks: https://github.com/jart/cosmopolitan/commit/dd8544c3bd7899ad...

        • senderista a day ago |
          > You have Doug Lea to thank for that.

          Wait, you don't mean your allocator is based on dlmalloc, do you?

          • jart a day ago |
            Yes. Is there something wrong with that? If your program links pthread_create() then cosmo creates a dlmalloc arena for each core on your computer and uses sched_getcpu() to index them combined with raw nsync locks.
      • wrsh07 2 days ago |
        Another example is hash maps: all the large companies built better maps in cpp (folly, absl), but the number of apps that are performance sensitive and still use std::unordered_map will be astounding forever.

        (Why not upstream? ABI compatibility which is apparently a sufficient veto reason for anything in cpp)

        • menaerus a day ago |
          No, that's rubbish that became a meme. There is no universally the best design of a hash-map. Each design always comes with its own trade-offs and so is the case with open-addressing vs separate chaining in conflict resolution. Iterator instability and memory-bandwidth intensive rehashing just to name a few.

          Design of your hash-map, or any other algorithm or data-structure, is always and only dictated by your workload. Microbenchmark suite from xyz company/person will hardly represent that reality.

          So the more generous answer would be it's not because of "ABI" meme but because std::unordered_map is just fine for the most cases out there. And this is what general purpose standard library is ought to be able to support.

          • wrsh07 a day ago |
            Lol this is so incorrect it's almost funny. Google saved something like single digit % of CPU and memory fleet wide by switching to this map. That's not a micro benchmark
            • menaerus a day ago |
              Learn to read with understanding and learn to have some respect as well. Never have I said that it cannot make a difference but that it's not universal and that it depends on the workload.
              • wrsh07 44 minutes ago |
                Microbenchmark suite is very different from fleet wide profiling

                The whole point of switching every single use of unordered map to node/flat hash map is because it is always better. It always uses less memory (much less memory per node).

                Edit: what did I say that was disrespectful? I called you out for being wrong because you're very wrong

    • Const-me 2 days ago |
      My guess is, because what’s in these current standard libraries and OSes are good enough.

      Synchronizing multiple CPU cores together is fundamentally slow, there’s no ways around it. They are far apart on the chip, and sometimes even on different chips with some link between. When measuring time with CPU cycles that latency is rather slow.

      Possible to avoid with good old software engineering, and over time people who wanted to extract performance from their multi-core CPUs became good at it.

      When you’re computing something parallel which takes minutes, you’ll do great if you update the progress bar at a laughable 5 Hz. Synchronizing cores 5 times each second costs nothing regardless of how efficient is the mutex.

      When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.

      Another notable example is multimedia frameworks. These handle realtime data coming at high frequencies like 48 kHz for audio, and they do non-trivial compute in these effect transforms and codecs so they need multiple cores. But they can tolerate a bit of latency so they’re just batching these samples. This dramatically saves IPC costs because you only need to lock these mutexes at 100Hz when batching 480 samples = 10ms of audio.

      • astrange 2 days ago |
        > They are far apart on the chip, and sometimes even on different chips with some link between.

        They aren't always. On a NUMA machine some are closer than others; M-series is an example. The cores are in clusters where some of the cache is shared, so atomics are cheap as long as it doesn't leave the cluster.

      • monocasa 2 days ago |
        > When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.

        That's not really the right way to do it. If you're actually using multiple cores in your code, you want to build a data dependency graph and let the cores walk that graph. Max node size ends up being something loosely like 10k items to be processed. You'll typically see hundreds of synchronization points per frame.

        This is the base architecture of stuff like Thread Building Blocks and rust's rayon library.

    • leni536 2 days ago |
      Are there ABI considerations for changing a pthread mutex implementation?
    • pnt12 2 days ago |
      > If it's so good, why haven't all C libraries adopted the same tricks?

      A man and a statiscian are walking down the street when they see a 50€ bill. The statistician keeps walking but the man stops and says "hey, look at this cash on the floor?". But the statistician, uninpressed, says: "must be fake, or someone would have already picked it up". And continues walking. The other man grabs the cash and takes it.

  • Uehreka 2 days ago |
    So on the one hand, all this Cosmo/ape/redbean stuff sounds incredible, and the comments on these articles are usually pretty positive and don’t generally debunk the concepts. But on the other hand, I never hear mention of anyone else using these things (I get that not everyone shares what they’re doing in a big way, but after so many years I’d expect to have seen a couple project writeups talk about them). Every mention of Cosmo/ape/redbean I’ve ever seen is from Justine’s site.

    So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving? Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?

    • almostgotcaught 2 days ago |
      > So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving?

      it's not that complicated; they're fat binaries (plus i guess a lot of papering over the differences between the platforms) that exploit a quirk of tshell:

      > One day, while studying old code, I found out that it's possible to encode Windows Portable Executable files as a UNIX Sixth Edition shell script, due to the fact that the Thompson Shell didn't use a shebang line.

      (https://justine.lol/ape.html)

      so the answer is simple: i can't think of anyone that wants to ship fat binaries.

      • fragmede 2 days ago |
        Anyone focused on customer experience wants to ship a binary that just works, without customers having to know what a fat binary even is.
        • thefaux 2 days ago |
          I don't think macos will allow you to run a downloaded cosmo binary without going into security settings and enabling it. That's not an experience I'd want my customers to have personally which means if you care about mac normies, this isn't a viable approach.
          • fragmede 2 days ago |
            I don't know what magic is involved, but I was able to download eg https://cosmo.zip/pub/cosmos/bin/md5sum and run it in a terminal without having to do run 'xattr -d com.apple.quarantine' on it.
        • bluGill 2 days ago |
          Unless the customer is one who has issues with large downloads. Not everyone has fiber to the home with massive storage on modern computers. Some people have the entry level systems, settle for slow internet. Often they are in poor or "third world" places which means maybe you don't care about them, but they exist.
          • 01HNNWZ0MV43FF 2 days ago |
            Fat binaries are fine. Electron is a fat binary in a different sense.
          • 0cf8612b2e1e 2 days ago |
            Given how fat modern websites are, compassion for the bandwidth deprived seems rare.
            • bluGill 2 days ago |
              Very much. Which is really a problem for those who do have bandwidth limitations. Or even low end phones / old computers. Or even new computers that are not very powerful.
      • cycomanic 2 days ago |
        Correct me if I'm wrong, but I always thought Apple universal binaries are fat binaries? So why did Apple build this ability if nobody wants it?
        • almostgotcaught 2 days ago |
          > So why did Apple build this ability if nobody wants it?

          did you miss the part where they transitioned from x86 to arm64 in 2020?

          • cycomanic 2 days ago |
            I don't get your point. People were arguing that one doesn't need fat binaries because cross compiling and having different binaries is fine. Apple clearly thought differently when they transitioned from x86 to arm (not their first architecture transition either).

            So now you're saying because apple finished the transition fat binaries are useless again? What about other platforms?

            • almostgotcaught 2 days ago |
              > finished the transition fat binaries are useless again?

              yup that's exactly what i'm saying.

            • oriolid a day ago |
              > apple finished the transition fat binaries are useless again?

              Is the transition really finished? I'm writing this on a x86_64 Macbook with a browser that is distributed as x86_64/arm64 universal binary.

          • E39M5S62 a day ago |
            And PowerPC to x86 before that.
        • yencabulator a day ago |
          Apple's fat binaries were never very "universal". They put in a codepath where macOS detects and handles macOS-arm-and-x86_64 binaries. That is something where they can support both variants, and themselves in control. Cosmopolitan libc is a pile of hacks, compared to that, more comparable to somebody making Linux boot on a 4004 CPU than what Apple did.
    • amiga386 2 days ago |
      APE works through cunning trickery that might get patched out any day now (and in OpenBSD, it has been).

      Most people producing cross-platform software don't want a single executable that runs on every platform, they want a single codebase that works correctly on each platform they support.

      With that in mind that respect, languages like go letting you cross compile for all your targets (provided you avoid CGO) is delightful... but the 3-ways-executable magic trick of APE, while really clever, doesn't inspire confidence it'll work forever, and for the most part it doesn't gain you anything. Each platform has their own packaging/signing requirements. You might as well compile a different target for each platform.

      • 0x1ceb00da 2 days ago |
        Wasn't elf format modified by upstream to accomodate for cosmo? That makes it kinda official. Still hard to see a use case for it. If you want everyone to be able to run your program, just write a web app, a win32 program, or a java applet. 20 years old java applets still run on modern JVMs.
        • bcoates 2 days ago |
          You can't reasonably assume the end user system has a system JVM installed (they probably don't, in fact) so they're not really an alternative to a fat binary -- if you can install dependencies, you can just pick a single-target binary while you're at it.
          • bpfrh a day ago |
            but you can provide that JVM for different plattforms in a zip file/installer.

            The more difficult problem with cross plattform apps are always native look and feel and calling native apis in an idiomatic and performant ways.

        • bmacho 2 days ago |
          Justine has similar claims about POSIX allowing binary in shell scripts now

          > This is an idea whose time has come; POSIX even changed their rules about binary in shell scripts specifically to let us do it.

          https://justine.lol/cosmo3/

          > Jilles Tjoelker from the FreeBSD project played an instrumental role in helping me to get the POSIX rules changed to allow binary in shell scripts, which is what made this project possible.

          https://justine.lol/ape.html

          but that's not true, as recently discussed here: https://news.ycombinator.com/item?id=41636569

        • int_19h 2 days ago |
          A web app is, well, a web app. Many things don't fit this format, e.g. command line tools.

          A Win32 program will not run out of the box on either Linux or macOS. Neither will a Java app.

          The nice thing about Cosmopolitan is that it "just works" as far as end user is concerned. But without firm support from the OSes involved, it is inevitably a hack with questionable long-term stability prospects.

          What we really need is some kind of standardized low-level (think LLVM bitcode or wasm) architecture- and platform-agnostic binary format that can be JIT-compiled to the same native code that a C compiler would have produced from the source. And that is supported by all major OSes out of the box.

          • jart 2 days ago |
            Yes that's exactly what we need. And we'll be laughing to the bank when we bundle a browser toolbar with adware into its installer ten years down the road. Oh wait Java already did this. OTOH Cosmopolitan gives you complete autonomy. You don't need a JVM to run a simple native command line program on multiple OSes and I proved that.
            • int_19h 2 days ago |
              JVM bytecode is obviously not what I was talking about - it is neither low-level (if you can't efficiently compile the entirety of C into it, it's not good enough for this purpose), nor did it ever have universal first-class OS support.

              And stuff like https://github.com/jart/cosmopolitan/issues/1263 shows that just because it works today, it doesn't mean that it'll work tomorrow. What's the guarantee that the same won't happen on Linux or macOS eventually?

              Don't get me wrong, APE is a hell of a hack, and I love it for that. But it can't be a truly stable platform without guarantees from the underlying OSes to keep all of that machinery working the same way. And if we could actually get them to agree on something like that, why not get them to agree on an actual portable executable format that is future-proof also wrt new CPU architectures?

              (Note, by the way, that this is orthogonal to the whole notion of fat binaries that contain arch-specific code. That is still a useful optimization to have in practice even if one has portable bytecode, alongside said portable bytecode.)

              • jart 2 days ago |
                I agree. As a user you should have those guarantees. Here's how I'm helping you get them.

                - With Windows, APE is just PE and WIN32 so those are already guaranteed by Microsoft.

                - With Linux, I've been working with kernel maintainers to build consensus around around a patch https://justine.lol/ape.patch that will let your APE binaries be loaded by binfmt_elf. This will provide you with a guarantee that they'll still work, even if someone has for instance registered a binfmt_misc entry like WINE that's intercepting MZ executables.

                - With FreeBSD, I'm also working to build consensus here around another patch, that will help the kernel load APE natively as a normal ELF executable. That means you won't need to rely on the system shell to load your program into memory.

                - I've been writing a specification for APE that documents this file format, its ABI, and its historical encodings. https://github.com/jart/cosmopolitan/blob/master/ape/specifi...

                One day in the future, when we have the support of major OSes for the native loading of portable binaries, we'll look back at the APE shell script bootstrapping process similar to how we look back at self-extracting zip archives. When Phil Katz first invented PKZIP, it was useful to prepend the software to each archive. But nowadays it isn't necessary, since all OSes come with support for PKZIP built in.

                • int_19h 2 days ago |
                  I rather suspect that Apple will not play along.

                  That said, a fat binary format that works for both Windows and Linux is already very useful.

                  • jart 2 days ago |
                    If we earn the support of Linux and FreeBSD then we might stand of chance. Apple might ask for a better way to secure ape executables, which could be potentially exciting, since it'd be nice to have a vendor neutral solution to code signing.
          • matheusmoreira 2 days ago |
            > But without firm support from the OSes involved, it is inevitably a hack with questionable long-term stability prospects.

            Case in point:

            https://github.com/jart/cosmopolitan/blob/master/libc/sysv/s...

            The system call numbers of all the unixlikes are bitwise packed into a single number. There is exactly one of those columns which is stable: the Linux one. Everything else is not part of the binary interface of their respective operating systems.

            I've written about how Linux is special in this regard:

            https://www.matheusmoreira.com/articles/linux-system-calls

            It's a neat hack but I'm afraid it's in grave danger of falling victim to the Darth Vader of OS ABI stability.

            https://lwn.net/Articles/806870/

              Program to the API rather than the ABI.
              When we see benefits, we change the ABI
              more often than the API.
            
              I have altered the ABI.
              Pray I do not alter it further.
              
              -- Theo de Raadt
      • fragmede 2 days ago |
        > Most people producing cross-platform software don't want a single executable that runs on every platform

        They don't? Having one file to download instead of a maze of "okay so what do you have" is way easier than the current mess. It would be very nice not to have to ask users what platform they're on, they just click a link and get the thing.

        • bluGill 2 days ago |
          Trade offs. While there is a point to that, memory, bandwidth and storage are not free. Users with constrains will want a smaller executable. Of course all 3 are pretty cheap these days so more and more you can get by with just one does it all executable and nobody will care about the downsides, but lets not lose track of them.
        • eptcyka 2 days ago |
          Having the browser get the right URL for the download button is by far the most trivial issue when it comes to multi-platform support.
      • hyperion2010 2 days ago |
        > Most people

        We'll I'm used to not being most people, but I'd much rather be able to produce a single identical binary for my users that works everywhere than the platform specific nonsense I have to go through right now. Having to maintain different special build processes for different platforms is a stupid waste of time.

        Frankly this is how it always should have worked except for the monopolistic behavior of various platforms in the past.

        • bjourne 2 days ago |
          The binary is only one part of the puzzle (and largely solved by WSL). Installation/uninstallation and desktop integration is just as much of a hassle.
          • cycomanic 2 days ago |
            I don't get the argument, if the binary part is solved by WSL it isn't useless is it? Otherwise why would MS invest so much resources into it?
            • bjourne 2 days ago |
              WSL can do a lot more than just run Linux binaries.
          • int_19h 2 days ago |
            I don't think you can reasonably assume that people have WSL set up on Windows for the purposes of shipping desktop software. Nor does it cover Mac.
            • duped 2 days ago |
              Well, there's orb on mac which is in some ways better/faster than WSL
        • yencabulator a day ago |
          > that works everywhere

          This whole discussion is about pointing out that it kinda doesn't.

      • cyberax 2 days ago |
        > With that in mind that respect, languages like go letting you cross compile for all your targets (provided you avoid CGO)

        Even that is not a big deal in most of cases, if you use zig to wrap CC: https://dev.to/kristoff/zig-makes-go-cross-compilation-just-...

        • jrockway 2 days ago |
          Does this still work? The article is from 2021 but when I last tried it this year, Go appeared to (newly) depend on headers that Zig doesn't need and thus it doesn't work. The Github issue was something like "yeah, we don't need those, so I guess Go doesn't work anymore". Without the actual error message I can't find the issue, however, so maybe I imagined this.
      • karel-3d 2 days ago |
        what worked for me for cross-compiling go:

        https://github.com/elastic/golang-crossbuild docker images for macos and windows (not linux though)

        https://github.com/rust-cross/rust-musl-cross docker images for linux (yes, it's possible to use the same one for go, you just need to put go binary there) so it's static with musl libc and sure to not have some weird issues with old/new libc on old/new/whatever linuxes

        (it might be over the top and the zig thing might effectively do the same thing, but I never got it to compile)

        oh and apple-codesign rust cargo to get through the signing nonsense

    • jkachmar 2 days ago |
      reposting my comment from another time this discussion came up:

      "Cosmopolitan has basically always felt like the interesting sort of technical loophole that makes for a fun blog post which is almost guaranteed to make it to the front page of HN (or similar) purely based in ingenuity & dedication to the bit.

      as a piece of foundational technology, in the way that `libc` necessarily is, it seems primarily useful for fun toys and small personal projects.

      with that context, it always feels a little strange to see it presented as a serious alternative to something like `glibc`, `musl`, or `msvcrt`; it’s a very cute hack, but if i were to find it in something i seriously depend on i think i’d be a little taken aback."

      • wmf 2 days ago |
        But what's actually bad about it?
      • sublimefire 2 days ago |
        The problem with this take is that it is not grounded in facts that should determine if the thing is good or bad.

        Logically having one thing instead of multiple makes sense as it simplifies the distribution and bytes stored/transmitted. I think the issue here is that it is not time tested. I believe multiple people in various places around the globe think about it and will try to test it. With time this will either bubble up or be stale.

    • blenderob 2 days ago |
      > Is there a catch?

      I am only speaking for myself here. While cosmo and ape do seem very clever, I do not need this type of clever stuff in my work if the ordinary stuff already works fine.

      Like for example if I can already cross-compile my project to other OSes and platforms or if I've got the infra to build my project for other OSes and platforms, I've no reason to look for a solution that lets me build one binary that works everywhere.

      There's also the thing that ape uses clever hacks to be able to run on multiple OSes. What if those hacks break someday due to how executable formats evolve? What if nobody has the time to patch APE to make it compatible with those changes?

      But my boring tools like gcc, clang, go, rust, etc. will continue to get updated and they will continue to work with evolving OSes. So I just tend to stick with the boring. That's why I don't bother with the clever because the boring just works for me.

      • sublimefire 2 days ago |
        If the executable format changes then it would affect more than just ape/cosmopolitan. All of the old binaries would just stop working.
        • blenderob 2 days ago |
          I am not worried about the scenario where the executable format changes breaks old binaries. That is never going to happen. No OS worth its name is going to do that.

          I am worried about the scenario where the executable format changes just enough to break APE binaries. Like I said, APE uses very clever hacks to make itself work on multiple OSes. It is conceivable that executable formats change just enough that it breaks the clever hacks in APE binaries without breaking old ordinary binaries.

    • tangus 2 days ago |
      Last year I needed to make a small webapp to be hosted on a Windows server, and I thought RedBean would be ideal for it. Unfortunately it was too buggy (at least on Windows).

      I don't know whether RedBean is production-ready now, but a year and a half ago, that was the catch.

      • jart 2 days ago |
        Give the latest nightly build a try: https://cosmo.zip/pub/cosmos/bin/ Windows has been a long hard march, but we've recently hit near feature completion. As of last month, the final major missing piece of the puzzle was implemented, which is the ability to send UNIX signals between processes. Cosmopolitan does such a good job on Windows now that it's not only been sturdy for redbean, but much more mature and complicated software as well, like Emacs, GNU Make, clang, Qt, and more.
    • dundarious 2 days ago |
      Mozilla llamafile uses it. Bundles model weights and an executable into a single file, that can be run from any cosmo/ape platform, and spawns a redbean http server for you to interact with the LLM. Can also run it without the integrated weights, and read weights from the filesystem. It's the easiest "get up and go" for local LLMs you could possibly create.
      • cowsandmilk 2 days ago |
        Mozilla’s llamafile is primarily developed by jart. I wouldn’t view it as anyone else actually using cosmo/ape
        • dundarious 2 days ago |
          Yes I should have mentioned that, even when I originally interpreted the OP to be asking about other projects using it, not necessarily other people. But they almost certainly meant other people not other projects, so my point is kind of void.

          Although in fairness, the Mozilla org choosing Justine/cosmo is basically the kind of sign-off from a 3rd party that OP was looking for.

    • sfn42 2 days ago |
      Most people aren't writing C as far as I know. We use Java, C#, Go, Python etc, some lunatics even use Node.

      We generally don't care if some mutex is 3x faster than some other mutex. Most of the time if I'm even using a mutex which is rare in itself, the performance of the mutex is generally the least of my concerns.

      I'm sure it matters to someone, but most people couldn't give two shits if they know what they're doing. We're not writing code where it's going to make a noticeable difference. There are thousands of things in our code we could optimize that would make a greater impact than a faster mutex, but we're not looking at those either because it's fast enough the way it is.

      • secondcoming 2 days ago |
        What a pointless contribution
        • yas_hmaheshwari 2 days ago |
          I came here to write this. In such a good discussion thread, this was such a sore thumb
    • int_19h 2 days ago |
      Mozilla has a project called Llamafile (https://github.com/Mozilla-Ocho/llamafile) that's based on Cosmopolitan libc. And they do regularly publish popular models repackaged in that format on Hugging Face: https://huggingface.co/models?search=llamafile.

      Whether that in turn has any practical use beyond quickly trying out small models is another question.

    • gavindean90 2 days ago |
      Tbh I don’t know that there is a catch except that there is one core person behind whole project. I like that there are fresh ideas in the C space.
    • fragmede 2 days ago |
      > Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?

      If I went up to you in 2008, and said, hey, lets build a database that doesn't do schemas, isn't relational, doesn't do SQL, isn't ACID compliant, doesn't to joins, has transactions as an afterthought, and only does indexing sometimes, you'd think I was trolling you. And then in 2009, Mongodb came out and caught on in various places. So only time will tell if these are ingenious tools that haven't caught on yet. There's definitely a good amount of genius behind it, though only time will tell if it's remembered in the vein of tom7's harder drives or if it sees wider production use. I'll say that if golang supported it as a platform, I'd switch my build pipelines at work over to it, since it makes their output less complicated to manage as there's only a single binary to deal with instead of 3-4.

  • 1st1 2 days ago |
    > I've even had the opportunity to make upstream contributions. For example, I found and fixed a bug in his mutex unlock function that had gone undiscovered for years.

    I see a stream of improvements to the vendored in nsync inside the cosmopolitan project [1]. Are you planning on upstreaming most of those too?

    A separate question -- is using the upstream nsync as safe as using your fork?

    [1] https://github.com/jart/cosmopolitan/commits/master/third_pa...

    • jart 2 days ago |
      If Burrows wants my C11 atomics refactoring then he shall have it. Beyond that, my changes mostly concern libc integration, systems integration, and portability. Our projects have different goals in those areas, so I'm not sure he needs them.
  • yshui 2 days ago |
    I had the pleasure of reverse-engineering win32 SRWLOCKs, and based on the author description of nsync it is very close to how SRWLOCK works internally. Kind of surprised how much faster nsync is compared to SRWLOCK.
    • TinkersW 2 days ago |
      The post doesn't include any benchmarks for the uncontended case, where I've found SRWLock is also very fast, I'm curious why this wasn't included.

      At least for what I use locks for, the uncontended case is like 10000x more common, I actually don't think I have any super heavy contention such as the case shown in the post, as this is simply something to be avoided--as no no matter how good the mutex this won't play well with the memory system.

  • betimsl 2 days ago |
    And here I thought musl was better than libc sigh
    • favorited 2 days ago |
      musl is a libc, and while it is superior in some ways, it is inferior in others. If you want a statically linked libc, or a permissively licensed libc, musl is a fantastic choice. If you want a libc with the fastest malloc implementation, you'll probably want to look elsewhere.
      • EasyMark 2 days ago |
        Can’t you sub in your own malloc like the one in the article or jemalloc if you don’t like the performance of the original in your app?
        • favorited 2 days ago |
          Sure, and lots of projects do. I was just trying to make the point that there isn't one "good" libc that is universally better – each project has costs and benefits that are tied to its priorities.
  • akira2501 2 days ago |
    Production isn't about speed, efficiency, or obviously "clever hacks."

    If I have to sacrifice 50% of my efficiency to ensure that I never get called on Sunday at 3am to fix a broken system, no kidding, I'll make that trade every time.

    Production is about _reliability_. And writing reliable code is 10x harder than writing "fast" code.

  • kgeist 2 days ago |
    >Contention is where mutex implementations show their inequality. Mark was so impressed by Microsoft's SRWLOCK that he went on to recommend Linux and FreeBSD users consider targeting Windows if mutex contention is an issue.

    Interesting, I remember reading a detailed article where they found that there's a lot of severe contention in the Windows kernel, compared to Linux. I think it was when they were trying to parallelize Chrome builds?

    • TinkersW 2 days ago |
      Maybe they weren't using SRWLock, at least last time I checked std::mutex didn't use it with MS STL(They were stuck with critical section because of binary compatibility).
      • markdoubleyou 2 days ago |
        I'm the Mark who's referenced there. When I did that original benchmark I discovered that the underlying mutex used by MSVCRT did change between versions. For example, in Visual C++ 2013, they used the Windows Concurrency Runtime, which was awful under heavy contention. Newer MSVCRT versions use SRWLOCK.

        (And I wouldn't characterize myself as being overly impressed... for my particular scenario I wrote, "if you have a poorly written app that's bottlenecked on a lock, then consider targeting Windows to make the best of a bad situation." A better approach, of course, would be to just improve your code!)

    • andreareina 2 days ago |
  • dumdood 2 days ago |
    Threads and mutexes are the most complicating things in computer science. I am always skeptical of new implementations until they've been used for several years at scale. Bugs in these threading mechanisms often elude even the most intense scrutiny. When Java hit the scene in the mid 90s it exposed all manner of thread and mutex bugs in Solaris. I don't want the fastest mutex implementation - I want a reliable one.
    • xxs 2 days ago |
      mutexes are far from the most 'complicated'. There are really not that many way to implement them (efficiently). In most cases there are best avoided, esp. on the read paths.
  • flymasterv 2 days ago |
    Mutices.
  • Akhilmurali 2 days ago |
    Just curious how the wall time is lesser than the sum of system time and user time?
    • yencabulator a day ago |
      More than one core.
  • petermcneeley 2 days ago |
  • scudsworth 2 days ago |
    is it weird to compare a 96 and 24 core threadripper? i didn't see anything about spawning more threads to make up for the additional cores or anything like that..
  • SleepyMyroslav 2 days ago |
    Completely tangential:

    As gamedev I came to love slow mutexes that do a lot of debug things in all 'developer' builds. Have debug names/IDs, track owners, report time spent in contention to profiler, report ownership changes to profiler...

    People tend to structure concurrency differently and games came to some patterns to avoid locks. But they are hard to use and require programmer to restructure things. Most of the code starts as 'lets slap a lock here and try to pass the milestone'. Even fast locks will be unpredictably slow and will destroy realtime guarantees if there were any. They can be fast on average but the tail end is never going to go away. I don't want to be that guy who will come back to it chasing 'oh our game has hitches' but I am usually that guy.

    Use slow locks people. The ones that show big red in profiler. Refactor them out when you see them being hit.

    I know its a tall order. I can count people who know how to use profiler by fingers on AAA production. And it always like that no matter how many productions I see :)

    ps. sorry for a rant. Please, continue good research into fast concurrency primitives and algorithms.

    • simonask 2 days ago |
      Even more tangentially: This is one of the reasons I'm having a great time developing a game in Rust.

      You never want lock contention in a game if at all avoidable, and in a lot of cases taking a lock is provably unnecessary. For example: Each frame is divided into phases, and mutable access to some shared resource only needs to happen in one specific phase (like the `update()` function before `render()`, or when hot-reloading assets). With scoped threads and Rust's borrowing rules, you can structure things to not even need a mutex, and be completely certain that you will receive a stern compiler error if the code changes in way so you do.

      When possible, I'd always love to take a compiler error over a spike in the profiler.

    • loeg a day ago |
      Totally agree. Debugging features -- e.g. deadlock detection / introspection -- easily pay for themselves. If you're actually acquiring locks so frequently they matter, you should revisit your design. Sharing mutable state between thread should be avoided.
  • FooBarWidget 2 days ago |
    I don't get it. Aren't most of these nsync tricks also implemented in glibc mutexes? The only thing in that list that's new to me, is long waiting. But even then I don't get it: futexes are already supposed to handle contended waiting.
  • mareko 2 days ago |
    If anyone's interested in this general subject matter, a while back I did some academic research on highly scalable locks where we came up with some very high performance reader-writer locks:

    https://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.p...

  • Hendrikto 2 days ago |
    > It's still a new C library and it's a little rough around the edges. But it's getting so good, so fast, that I'm starting to view not using it in production as an abandonment of professional responsibility.

    A bit over the top…

  • c1b 2 days ago |
    Its simple -- you see a justine.lol url, you click immediately
  • peter_d_sherman 2 days ago |
  • deviantbit 2 days ago |
    This code benchmarks mutex contention, not mutex lock performance. If you're locking like this, you should reevaluate your code. Each thread locks and unlocks the mutex for every increment of g_chores. This creates an overhead of acquiring and releasing the mutex frequently (100,000 times per thread). This overhead masks the real performance differences between locking mechanisms because the benchmark is dominated by lock contention rather than actual work. Benchmarks such as this one are useless.
  • raggi a day ago |
    I went in half expecting to see another spin heavy solution, so glad to find futexes used properly, cache line sharding and so on. Yay, practically scalable mutexes rather than microbenchmark winning hacks!