Why is Apple Rosetta 2 fast? (2022)
172 points by fanf2 5 days ago | 76 comments
  • Syonyk 5 days ago |
    Post got the big one: Total Store Ordering (TSO).

    The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.

    The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.

    • jsheard 5 days ago |
      It's surprising that (AFAIK) Qualcomm didn't implement TSO in the chips they made for the recent-ish Windows ARM machines. If anything they need fast x86 emulation even more than Apple does since Windows has a much longer tail of software support than macOS, there's going to be important Windows apps that stubbornly refuse to support native ARM basically forever.
      • scottlamb 5 days ago |
        Does Windows's translation take advantage of those where they exist? E.g. if I launch an aarch64 Windows VM on my M2, does it use the M2's support for TSO when running x86_64 .exes or does it insert these memory barriers?

        If not, it makes sense that Qualcomm didn't bother adding them.

        • Syonyk 5 days ago |
          I would expect it to not use TSO, because the toggle for it isn't, to the best of my knowledge, a general userspace toggle. It's something the kernel has to toggle, and so a VM may or may not (probably does not) even have access to the SCRs (system control registers) to change it.
          • zeusk 5 days ago |
            TSO toggle on Apple Silicon is a user-space accessible/writable register.

            It is used when you install rosetta2 for Linux VMs

            https://developer.apple.com/documentation/virtualization/run...

            • Syonyk 5 days ago |
              Are you sure it's userspace accessible?

              Based on https://github.com/saagarjha/TSOEnabler/blob/master/TSOEnabl..., it's a field in ACTLR_EL1, which is explicitly (per the ARMv8 spec, at least...) not accessible to userspace (EL0) execution.

              There may be some kernel interface to allow userspace to toggle that, but that's not the same as being a userspace-accessible SCR (and I also wouldn't expect it to be passed through to a VM - you'd likely need a hypercall to toggle it, unless the hypervisor emulated that, though admittedly I'm not quite as deep weeds on ARMv8 virtualization as I would prefer at the moment.

              • zeusk 5 days ago |
                Hmm, you’re right - maybe my memory serves incorrectly but yeah it seems it is privileged access but the interface is open to all processes to toggle the bit.
            • shadowfacts 5 days ago |
              It is not directly accessible from user-space. Making it so requires kernel support. Apple published a set of patches for doing this on Linux: https://developer.apple.com/documentation/virtualization/acc...

              Without that kernel support, all processes in the VM (not just Rosetta-translated ones) are opted-in to TSO:

              > Without selective enablement, the system opts all processes into this memory mode [TSO], which degrades performance for native ARM processes that don’t need it.

              • mrpippy 4 days ago |
                Before Sequoia, a Linux VM using Rosetta would have TSO enabled all the time.

                With Sequoia, TSO is not enabled for Linux VMs, and that kernel patch (posted in the last few weeks) is required for Rosetta to be able to enable TSO for itself. If the kernel patch isn't present, Rosetta has a non-TSO fallback mode.

          • saagarjha 4 days ago |
            This is exposed to guest kernels of Sequoia (and maybe earlier?).
        • zeusk 5 days ago |
          The OS can use what hardware supports, Mac OS does because SEG is a tightly integrated group at Apple whereas Microsoft treats hardware vendors at arm's length (pun unintended). There are roadmap sharing, planning events through leadership but it is not as cohesive as it is at Apple.
        • saagarjha 4 days ago |
          No because Windows is not aware of how Apple does it. There exist Linux patches documenting how to do so, though.
          • scottlamb 4 days ago |
            The article says the following:

            > As far as I know this is not part of the ARM standard, but it also isn’t Apple specific: Nvidia Denver/Carmel and Fujitsu A64fx are other 64-bit ARM processors that also implement TSO (thanks to marcan for these details).

            I'm not sure how to interpret that—do these other parameters have distinct/proprietary TSO extensions? Are they referring to a single published (optional) extension that all three implement? The linked tweet has been deleted so no clues there, and I stopped digging.

            • saagarjha 4 days ago |
              Those are just TSO all the time I think. So they are stronger than the ARM requirement
      • deaddodo 5 days ago |
        Microsoft's AoT+JiT techniques still pull off impressive performance (90+% in almost every case, 96-99% in the majority).

        But yes, if they were actually serious about Windows on ARM, they would have implemented TSO in their "custom" Qualcomm SQ1/SQ2 chips.

        • wtallis 4 days ago |
          Last time I checked, the default behavior for Microsoft's translation was to pretend that the hardware is doing TSO, and hope it works out. So that should obviously be fast, but occasionally wrong.
          • saagarjha 4 days ago |
            They're a decent bit smarter than that but yes their emulation is not quite correct.
        • 486sx33 4 days ago |
          Its funny, Microsoft and ARM emulation is good because of Qualcomm, or rather in spite of Qualcomm's limitations.

          If Qualcomm had done better, then the software wouldn't have to be so good, and they'd likely have maintained more market share.

          Instead, Microsoft had to make their x86 on arm emu good enough to work on Qualcomm's crap, which now works really nicely on apple arm.

      • Syonyk 5 days ago |
        My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance. Think your classic Visual Basic 6 sort of thing that a business relies on for decades.

        I'm also fairly certain that the TSO changes to the memory system are non-trivial, and it's possible that Qualcomm doesn't see it as a value-add in their chips - and they're probably right. Windows machines are such a hot mess that outside a relatively small group of users (who probably run Linux anyway, so aren't anyone's target market), nobody would know or care what TSO is. If it add costs and power and doesn't matter, why bother?

        • jsheard 5 days ago |
          > My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance.

          Games are a pretty notable exception that demand high performance and for the most part will be stuck on x86 forever. Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.

          • doctorpangloss 4 days ago |
            > Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.

            Unity supports Windows ARM. Unreal: probably never. IMO, the PC gaming market is so fragmented, short of Microsoft developing games for the platform, like pre-sales scale multi-millions that EGS did, games on ARM will only happen by complete accident, not because it makes sense.

        • tiagod 4 days ago |
          > My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance. Think your classic Visual Basic 6 sort of thing that a business relies on for decades.

          In my experience, there's a lot of that kind of software around that was initially designed for a much simpler use-case, and has decades of badly coded features bolted in, with questionable algorithmic choices. It can be unreasonably slow in modern hardware.

          Old government database sites are the worst examples in my experience. Clearly tested with a few hundred records, but 15 years later there's a few million and nobody bothered to create a bunch of indexes so searches take a couple minutes. I guess this way they can just charge to upgrade the hardware once in a while instead.

        • adrian_b 4 days ago |
          TSO only matters for programs that are internally multithreaded or which run multiple processes that have shared memory segments.

          Most legacy programs like Visual Basic 6 are not of this kind.

          For any other kinds of applications, the operating system handles the concurrency and it does this in the correct way for the native platform.

          Nevertheless, the few programs for which TSO matters are also those where performance must have mattered if the developers bothered to implement concurrent code. Therefore low performance of the emulated application would be noticeable.

      • p_l 5 days ago |
        Qualcomm has been phoning it in in various forms for over a decade, including forcing MS to ship machines that do not really pass windows requirements (like broken firmware support). Maybe it got fixed with recent Snapdragon X, but I won't hold my breath.

        We're talking about a company that, if certain personal sources are to be believed, started the Snapdragon brand by deciding to cheapen out on memory bandwidth despite feedback that increasing it was critical and leaving the client to find out too late in the integration stage.

        Deciding that they make better money by not spending on implementing TSO, or not spending transistors on bigger caches, and getting more volume at lower cost, is perfectly normal.

      • mdasen 5 days ago |
        It's definitely surprising that Qualcomm didn't. Not only does Windows have a longer tail of software to support, but given that the vast majority of Windows machines will continue to be x86-64, there's little incentive to do work to support ARM.

        With the Mac, Apple told everyone "we're moving to ARM and that's final." With Windows, Microsoft is saying, "these ARM chips could be cool, what do you think?" On the Mac, you either got on board or were left behind. Users knew that the future was ARM and bought machines even if there might be some short-term growing pains. Developers knew that the future was ARM and worked hard to support it.

        But with Windows, there isn't a huge incentive for users to switch to ARM and there isn't an incentive for developers to encourage it. You can say there's some incentive if the ARM chips are better. While Qualcomm's chips are good, the benchmarks aren't really ahead of Intel/AMD and they aren't the power-sipping processors that Apple is putting out.

        If Apple hadn't implemented TSO, Mac users/developers would still switch because Apple told them to. Qualcomm has to convince users that their chips are worth the short-term pain - and that users shouldn't wait a few years to make the switch when the ecosystem is more mature. That's a much bigger hill to climb.

        Still, for Qualcomm, they might not even care about losing a little money for 5-10 years if it means they become one of the largest desktop processor vendors for the following 20+ years. As long as they can keep Microsoft's interest in ARM as a platform, they can bide their time.

        • bee_rider 4 days ago |
          I wonder if possible Qualcomm doesn’t super care about the long tail of software? Like maybe MS has some stats indicating that a very large percentage of software that they think will be used on these devices is first party, or stuff that reasonably should be expected to be compiled for ARM.

          How does the windows App Store work anyway, can they guarantee that all the stuff there gets compiled for ARM?

          Anyway, it is Windows not MacOS. The users expect some rough edges and poor craftsmanship, right?

          • RockRobotRock 4 days ago |
            I switched completely to macOS and trust me there are plenty of rough edges too. They're just in different places.
          • kalleboo 4 days ago |
            The Qualcomm chips come from their acquisition of Nuvia, who were originally designing the chips as server chips, where the workload would presumably be Linux stuff compiled for the right arch. They probably didn't have time to redesign the chip from the original server-oriented part to add TSO.
          • rodgerd 4 days ago |
            > I wonder if possible Qualcomm doesn’t super care about the long tail of software?

            Qualcomm's success is based more on its patent portfolio and how well it uses it, more than any other single factor. It doesn't really have to compete on quality, and their support has long been terrible - they're one of the main drivers of Android's poor reputation for hardware end-of-life. It doesn't matter though, because they have no meaningful competition in many areas.

            • bigfatkitten 5 hours ago |
              They are also the main reason ARM Chromebooks are a relatively recent development. Google wanted 5-10 years of support, Qualcomm preferred 5-10 minutes.
        • xmodem 4 days ago |
          > With the Mac, Apple told everyone "we're moving to ARM and that's final."

          In ~mid 2020, when macs were all-but-confirmed to be moving to Apple-designed chips, but before we had any software details, some commentators speculated that they thought Apple wouldn't build a compatibility layer at all this time around.

      • saagarjha 4 days ago |
        You can use RCpc atomics which are part of the standard architecture
      • dundarious 4 days ago |
        On a first order analysis, Qualcomm doesn't want good x64 support, because good x64 support furthers the lifetime of x64, and delays the "transition" to ARM. In the final analysis, I doubt that is an economically rational strategy, because even if there is to be a transition away from x64, you need a good legacy and migration story. And I doubt such a transition will happen in the next 10 years, and certainly not spurred by anything in Microsoft land.

        So maybe it's rational after all, because they know these Windows ARM products will never succeed, so they're just saving themselves the cost/effort of good support.

        • wolpoli 4 days ago |
          > On a first order analysis, Qualcomm doesn't want good x64 support, because good x64 support furthers the lifetime of x64, and delays the "transition" to ARM.

          The logical thing for Qualcomm in their current market share to do is to implement TSO now, then after they get momentum, create a high-end/low-end tier, and disable TSO for the low-end tier to force vendors to target both ARM/x68.

          What Qualcomm is doing now makes them look like they just don't care.

          • Someone 4 days ago |
            > create a high-end/low-end tier, and disable TSO for the low-end tier

            Wouldn’t that make the low-end tier run faster than the high-end tier, or force them to leave some performance on the table there?

            Also, would a per-process flag that controls TSO be possible? Ignoring whether it’s easy to do in the hardware, the only problem I can think of with that is that the OS would have to set that on processes when they start using shared memory, or forbid using shared memory by processes that do not have it set.

        • rowanG077 4 days ago |
          I would disagree that this is first order. First order is making the transition as smooth as possible, which obviously means having a very good translation layer. Only then you should even think about competing on comparability.
          • ddingus 4 days ago |
            Does this not depend on how one sees the Arm transition matter playing out?

            It at least conceivable and IMHO, plausible for Qualcomm to see Apple, phones on ARM and aging in demographics all speaking to a certain Arm transition?

            • rowanG077 4 days ago |
              I wouldn't be so sure. Windows on ARM has existed for more then a decade with almost zero adoption. Phones, both Apple and Android, have been ARM since forever. The only additional player is that Apple has moved their Macs to ARM. This to me means it would be pretty stupid for them to just throw up their hands and say "they will come". Because it didn't happen for a decade prior.
              • ddingus 4 days ago |
                Maybe. Just trying to see it from other points of view.

                A decade ago, Apple was on Intel and Microsoft had not advanced many plans in play today. Depending on the smoke they are blowing people's way, one could get an impression ARM is a sure thing.

                Frankly, I have no desire to run Windows on ARM.

                Linux? Yep.

                And I am already on a Mac M1.

                I sort of hope it fails personally. I want to see the Intel PC continue in some basic form.

      • Const-me 4 days ago |
        > Qualcomm didn't implement TSO in the chips they made

        I’m not sure they can do that.

        Under Technology License Agreement, Qualcomm can build chips using ARM-designed CPU cores. Specifically, Qualcomm SQ1 uses ARM Cortex A76/A55 for the fast/slow CPU cores.

        I don’t think using ARM designed cores is enough to implement TSO, need custom ARM cores instead of the stock ones. To design custom ARM cores, Qualcomm needs architecture license from ARM which was recently been cancelled.

        • jsheard 4 days ago |
          SQ1/SQ2 was their previous attempt, the recently released Snapdragon Elite chips use the fully custom Oryon ARM cores they got from the Nuvia acquisition. That acquisition is what caused the current licensing drama with ARM, but for the time being they do have custom cores.
          • Const-me 4 days ago |
            > custom Oryon ARM cores they got from the Nuvia acquisition

            Nuvia was developing server CPUs. By now, I believe backward compatibility with x86 and AMD64 is rather unimportant for servers. Hosting providers and public clouds have been offering ARM64 Linux servers for quite a few years now, all important server-running software already has native ARM64 builds.

    • ant6n 5 days ago |
      perhaps you could keep each process on one core. But that would kill multi-threaded performance.
    • saagarjha 4 days ago |
      TSO is nice to have but it's definitely not necessary. Rosetta doesn't even require TSO on Linux anymore by default. It performs fine for typical applications.
    • vlovich123 4 days ago |
      Is TSO something other than doing atomics with seq_cst?
      • j16sdiz 4 days ago |
        TSO is what x86 do when you are _not_ using atomics.
        • adrian_b 4 days ago |
          True, and it is a little more relaxed than sequential consistency.

          For simple loads and stores, the x86 CPUs do not reorder the loads between themselves or the stores between themselves. Also the stores are not done before previous loads.

          Only some special kinds of stores can be reordered, i.e. those caused by string instructions or the stores of vector registers that are marked as NT (non-temporal).

          So x86 does not need release stores, any simple store is suitable for this. Also store barriers are not normally needed. Acquire fences a.k.a. acquire barriers are sometimes needed, but much less often than on CPUs with weaker ordering for the memory accesses (for acquire fences both x86 and Arm Aarch64 have confusing mnemonics, i.e. LFENCE on x86 and DMB/DSB of the LD kind on Aarch64; in both cases these instructions are not load fences as suggested by the mnemonics, but acquire fences).

          When converting x86 code to Aarch64 code, there are many cases when simple stores must be replaced with release stores (a.k.a. Store-Release instructions in the Arm documentation) and there are many places where acquire barriers must be inserted, or, less frequently, store barriers must be inserted (for non-optimally written concurrent code it may also be necessary to replace some simple loads with Load-Acquire instructions of Aaarch64).

    • wbl 4 days ago |
      Barrier injection isn't the issue as much as the barriers becoming expensive. There's no reason a CPU can't use the techniques of TSO support to support lesser barriers just as cheaply.
    • commandersaki 4 days ago |
      I get pretty close to native performance with Rosetta 2 for Linux and I don't believe TSO is being used or taken advantage of. I wonder how important it really is.
  • brycewray 5 days ago |
    (2022)
  • leshokunin 5 days ago |
    Super interesting. Putting my PM hat on, I wonder: how many x86 apps on Apple still benefit from this much performance? What's the coverage? The switch to M1 happened 4 years ago, so the software was designed for hardware nearly half a decade old.

    Excellent engineering and nice that it was built properly. Is this something that Linux / Wine / the Steam compatibility layer already benefit from?

    • spockz 5 days ago |
      I think it is less of numbers game and more of a guarantee thing. As a user of a new Apple silicon machine you do not have to worry about running x86 software. (Aside from maybe specific audio software and such that are a pain to run on any other hardware and software combination.)

      As such it may very well be a loss leader and that is fine. Probably most development has been done and there is little maintenance needed.

      Also, while most native macOS apps that I encounter have an Apple silicon version now, I still find docker images for amd64 without an arm64 version present. Rosetta2 also helps with these applications.

    • aaomidi 5 days ago |
      Games. So many games.

      Also, x86 containers.

      • jsheard 5 days ago |
        Then again games didn't stop Apple from dropping x86-32 support, which nuked half of the Mac Steam library. It wouldn't be out of character for them to drop x86-64 support and nuke the rest which haven't been updated to native ARM.
        • p_l 5 days ago |
          For games on intel macs they had fallback of BootCamp so combined with not really caring about games outside of random bursts like support for Unity, they were fine telling people to run windows. (ironically, the only Mac I owned ran faster under windows than under macOS...)
        • darknavi 5 days ago |
          Or OpenGL support
          • rdsnsca 4 days ago |
            OpenGL was deprecated, not removed from macOS.
            • adrian_b 4 days ago |
              But Apple has never implemented the final specification of OpenGL.

              So even if they have kept the old OpenGL version that they had, many newer OpenGL-based applications cannot run on MacOS.

              Since OpenGL is no longer evolving, it would not have been a great effort to bring the OpenGL support to the last version, and only then freeze it.

        • astrange 5 days ago |
          Developers had something like 15 years of warning before x86-32 was dropped, which was enough for everyone except Carbon apps and games.

          Btw, Rosetta 2 actually supports x86-32. Which means you can run 32-bit Windows binaries through WINE, just not Mac 32-bit binaries.

          • Gigachad 4 days ago |
            The problem with games is most of them are completed projects and no longer actively worked on, unlike software which is a never ending development on a single project.

            So if you kill support for an old game, it will probably never be updated since it's no longer commercially relevant. Publishers are probably almost happy when old games get broken since they can sell you newer ones easier.

          • albumen 4 days ago |
            Re games, I've tried running Black Mesa recently 'directly' using Crossover on an M2 Max running Sequoia and the GPTK2 beta libraries; vs running it in a Win11 VM in VMware fusion 13. Performance of the latter is 2-3x better (70-120fps). I'll be playing old Steam games inside the win11 VM going forward.
        • saagarjha 4 days ago |
          Rosetta supports it at least. You can run Linux 32-bit games!
      • redwall_hp 4 days ago |
        VSTs. Just as people use vintage physical synthesizers, old guitars, and recording equipment, people use old software that might not ever be updated.
    • Syonyk 5 days ago |
      "Apple M-series chips emulating x86," in certain benchmarks and behaviors, was right up there with the fastest x86 chips at the time - I'd guess largely in stuff that benefited from the huge L1I/L1D cache (compared to x86).

      I had a M1 Mini for a while, and it played Kerbal Space Program (x86) far better than my previous Intel Mini, which had Intel Integrated Graphics that could barely manage a 4k monitor, much less actual gaming.

      I believe there's a way to use Rosetta with Linux VMs, too (to translate x86 VM applications to ARM and run them natively) - but I no longer have any Macs, so I've not had a chance to play with it.

    • throw16180339 4 days ago |
      There can't be that many other people using it, but I'm using Rosetta2 to run the SML/NJ (https://www.smlnj.org/) implementation of Standard ML. The development team is working on a macOS/ARM64 backend, though.
    • 7k5kyrty45 4 days ago |
      Does Apple ecosystem have any production software any enterprise or industry uses? I recall maintaining Windows Server 2000 related production software just last week. Working software can be really, really old
    • izacus 4 days ago |
      The more important question is how many people buy a MacBook because they know they can run an x86 app if they have to.

      Just because 0.1% of apps need a feature, the lack of it won't translate into only 0.1% of lost sales. People don't behave like that.

      So the more important question is: how many people moved to ARM because they felt they don't need to worry about compatibility with existing use cases?

  • dhosek 5 days ago |
    I wonder if these lessons might be applied to Wasm runtimes where the Wasm could be JIT compiled into native code. Of course this does raise the possibility of security concerns if the Wasm compilation has some bug, and then of course there’s also the question of whether Wasm’s requirements might mean native compilation doesn’t give much of a performance boost (as seems to be the case with e.g., Java byte code).
  • kccqzy 5 days ago |
    One other thing that is not mentioned is that Apple has an extension to compute rarely used x86 flags such as the parity flag in hardware rather than in software.
    • benchess 5 days ago |
      It’s mentioned
      • kccqzy 4 days ago |
        Ah I see it now. Sorry for the noise.
  • NL807 4 days ago |
    Good article.
  • transpute 4 days ago |
    Standardization of future Arm PCs, https://news.ycombinator.com/item?id=42182442

      The Arm PC Base System Architecture 1.0 (PC-BSA) specifies a standard hardware system architecture for Personal Computers (PCs) that are based on the Arm 64-bit Architecture. PC system software, for example operating systems, hypervisors, and firmware can rely on this standard system architecture. PC-BSA extends the requirements specified in the Arm BSA.
  • emmanueloga_ 4 days ago |
    Tangent: also why orbstack, a Docker replacement on Mac, is fast [1] (I'm not affiliated in any way, just a fan and happy user :-).

    --

    1: https://docs.orbstack.dev/features

    • commandersaki 4 days ago |
      OrbStack doesn't use Rosetta 2. It does support Rosetta 2 for Linux, but it's something you would have to go out of your way to use.
      • emmanueloga_ 3 days ago |
        I confess I don't really know the technical details of how they use Rosetta, but according to their docs [1]:

        "We use Rosetta to emulate x86 programs on Apple Silicon, which is much faster than the commonly-used QEMU."

        What I get is that rosetta is used when you run something on Docker that uses x86 architecture (I'm guessing x86_64), which for me is pretty often.

        --

        1: https://docs.orbstack.dev/architecture#low-level-vm-optimiza...