The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.
The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.
With the Mac, Apple told everyone "we're moving to ARM and that's final." With Windows, Microsoft is saying, "these ARM chips could be cool, what do you think?" On the Mac, you either got on board or were left behind. Users knew that the future was ARM and bought machines even if there might be some short-term growing pains. Developers knew that the future was ARM and worked hard to support it.
But with Windows, there isn't a huge incentive for users to switch to ARM and there isn't an incentive for developers to encourage it. You can say there's some incentive if the ARM chips are better. While Qualcomm's chips are good, the benchmarks aren't really ahead of Intel/AMD and they aren't the power-sipping processors that Apple is putting out.
If Apple hadn't implemented TSO, Mac users/developers would still switch because Apple told them to. Qualcomm has to convince users that their chips are worth the short-term pain - and that users shouldn't wait a few years to make the switch when the ecosystem is more mature. That's a much bigger hill to climb.
Still, for Qualcomm, they might not even care about losing a little money for 5-10 years if it means they become one of the largest desktop processor vendors for the following 20+ years. As long as they can keep Microsoft's interest in ARM as a platform, they can bide their time.
In ~mid 2020, when macs were all-but-confirmed to be moving to Apple-designed chips, but before we had any software details, some commentators speculated that they thought Apple wouldn't build a compatibility layer at all this time around.
How does the windows App Store work anyway, can they guarantee that all the stuff there gets compiled for ARM?
Anyway, it is Windows not MacOS. The users expect some rough edges and poor craftsmanship, right?
Qualcomm's success is based more on its patent portfolio and how well it uses it, more than any other single factor. It doesn't really have to compete on quality, and their support has long been terrible - they're one of the main drivers of Android's poor reputation for hardware end-of-life. It doesn't matter though, because they have no meaningful competition in many areas.
We're talking about a company that, if certain personal sources are to be believed, started the Snapdragon brand by deciding to cheapen out on memory bandwidth despite feedback that increasing it was critical and leaving the client to find out too late in the integration stage.
Deciding that they make better money by not spending on implementing TSO, or not spending transistors on bigger caches, and getting more volume at lower cost, is perfectly normal.
But yes, if they were actually serious about Windows on ARM, they would have implemented TSO in their "custom" Qualcomm SQ1/SQ2 chips.
If Qualcomm had done better, then the software wouldn't have to be so good, and they'd likely have maintained more market share.
Instead, Microsoft had to make their x86 on arm emu good enough to work on Qualcomm's crap, which now works really nicely on apple arm.
I'm also fairly certain that the TSO changes to the memory system are non-trivial, and it's possible that Qualcomm doesn't see it as a value-add in their chips - and they're probably right. Windows machines are such a hot mess that outside a relatively small group of users (who probably run Linux anyway, so aren't anyone's target market), nobody would know or care what TSO is. If it add costs and power and doesn't matter, why bother?
Games are a pretty notable exception that demand high performance and for the most part will be stuck on x86 forever. Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.
Unity supports Windows ARM. Unreal: probably never. IMO, the PC gaming market is so fragmented, short of Microsoft developing games for the platform, like pre-sales scale multi-millions that EGS did, games on ARM will only happen by complete accident, not because it makes sense.
In my experience, there's a lot of that kind of software around that was initially designed for a much simpler use-case, and has decades of badly coded features bolted in, with questionable algorithmic choices. It can be unreasonably slow in modern hardware.
Old government database sites are the worst examples in my experience. Clearly tested with a few hundred records, but 15 years later there's a few million and nobody bothered to create a bunch of indexes so searches take a couple minutes. I guess this way they can just charge to upgrade the hardware once in a while instead.
Most legacy programs like Visual Basic 6 are not of this kind.
For any other kinds of applications, the operating system handles the concurrency and it does this in the correct way for the native platform.
Nevertheless, the few programs for which TSO matters are also those where performance must have mattered if the developers bothered to implement concurrent code. Therefore low performance of the emulated application would be noticeable.
I’m not sure they can do that.
Under Technology License Agreement, Qualcomm can build chips using ARM-designed CPU cores. Specifically, Qualcomm SQ1 uses ARM Cortex A76/A55 for the fast/slow CPU cores.
I don’t think using ARM designed cores is enough to implement TSO, need custom ARM cores instead of the stock ones. To design custom ARM cores, Qualcomm needs architecture license from ARM which was recently been cancelled.
Nuvia was developing server CPUs. By now, I believe backward compatibility with x86 and AMD64 is rather unimportant for servers. Hosting providers and public clouds have been offering ARM64 Linux servers for quite a few years now, all important server-running software already has native ARM64 builds.
So maybe it's rational after all, because they know these Windows ARM products will never succeed, so they're just saving themselves the cost/effort of good support.
The logical thing for Qualcomm in their current market share to do is to implement TSO now, then after they get momentum, create a high-end/low-end tier, and disable TSO for the low-end tier to force vendors to target both ARM/x68.
What Qualcomm is doing now makes them look like they just don't care.
Wouldn’t that make the low-end tier run faster than the high-end tier, or force them to leave some performance on the table there?
Also, would a per-process flag that controls TSO be possible? Ignoring whether it’s easy to do in the hardware, the only problem I can think of with that is that the OS would have to set that on processes when they start using shared memory, or forbid using shared memory by processes that do not have it set.
It at least conceivable and IMHO, plausible for Qualcomm to see Apple, phones on ARM and aging in demographics all speaking to a certain Arm transition?
A decade ago, Apple was on Intel and Microsoft had not advanced many plans in play today. Depending on the smoke they are blowing people's way, one could get an impression ARM is a sure thing.
Frankly, I have no desire to run Windows on ARM.
Linux? Yep.
And I am already on a Mac M1.
I sort of hope it fails personally. I want to see the Intel PC continue in some basic form.
If not, it makes sense that Qualcomm didn't bother adding them.
It is used when you install rosetta2 for Linux VMs
https://developer.apple.com/documentation/virtualization/run...
Based on https://github.com/saagarjha/TSOEnabler/blob/master/TSOEnabl..., it's a field in ACTLR_EL1, which is explicitly (per the ARMv8 spec, at least...) not accessible to userspace (EL0) execution.
There may be some kernel interface to allow userspace to toggle that, but that's not the same as being a userspace-accessible SCR (and I also wouldn't expect it to be passed through to a VM - you'd likely need a hypercall to toggle it, unless the hypervisor emulated that, though admittedly I'm not quite as deep weeds on ARMv8 virtualization as I would prefer at the moment.
Without that kernel support, all processes in the VM (not just Rosetta-translated ones) are opted-in to TSO:
> Without selective enablement, the system opts all processes into this memory mode [TSO], which degrades performance for native ARM processes that don’t need it.
With Sequoia, TSO is not enabled for Linux VMs, and that kernel patch (posted in the last few weeks) is required for Rosetta to be able to enable TSO for itself. If the kernel patch isn't present, Rosetta has a non-TSO fallback mode.
> As far as I know this is not part of the ARM standard, but it also isn’t Apple specific: Nvidia Denver/Carmel and Fujitsu A64fx are other 64-bit ARM processors that also implement TSO (thanks to marcan for these details).
I'm not sure how to interpret that—do these other parameters have distinct/proprietary TSO extensions? Are they referring to a single published (optional) extension that all three implement? The linked tweet has been deleted so no clues there, and I stopped digging.
For simple loads and stores, the x86 CPUs do not reorder the loads between themselves or the stores between themselves. Also the stores are not done before previous loads.
Only some special kinds of stores can be reordered, i.e. those caused by string instructions or the stores of vector registers that are marked as NT (non-temporal).
So x86 does not need release stores, any simple store is suitable for this. Also store barriers are not normally needed. Acquire fences a.k.a. acquire barriers are sometimes needed, but much less often than on CPUs with weaker ordering for the memory accesses (for acquire fences both x86 and Arm Aarch64 have confusing mnemonics, i.e. LFENCE on x86 and DMB/DSB of the LD kind on Aarch64; in both cases these instructions are not load fences as suggested by the mnemonics, but acquire fences).
When converting x86 code to Aarch64 code, there are many cases when simple stores must be replaced with release stores (a.k.a. Store-Release instructions in the Arm documentation) and there are many places where acquire barriers must be inserted, or, less frequently, store barriers must be inserted (for non-optimally written concurrent code it may also be necessary to replace some simple loads with Load-Acquire instructions of Aaarch64).
Excellent engineering and nice that it was built properly. Is this something that Linux / Wine / the Steam compatibility layer already benefit from?
As such it may very well be a loss leader and that is fine. Probably most development has been done and there is little maintenance needed.
Also, while most native macOS apps that I encounter have an Apple silicon version now, I still find docker images for amd64 without an arm64 version present. Rosetta2 also helps with these applications.
I had a M1 Mini for a while, and it played Kerbal Space Program (x86) far better than my previous Intel Mini, which had Intel Integrated Graphics that could barely manage a 4k monitor, much less actual gaming.
I believe there's a way to use Rosetta with Linux VMs, too (to translate x86 VM applications to ARM and run them natively) - but I no longer have any Macs, so I've not had a chance to play with it.
Just because 0.1% of apps need a feature, the lack of it won't translate into only 0.1% of lost sales. People don't behave like that.
So the more important question is: how many people moved to ARM because they felt they don't need to worry about compatibility with existing use cases?
Also, x86 containers.
Btw, Rosetta 2 actually supports x86-32. Which means you can run 32-bit Windows binaries through WINE, just not Mac 32-bit binaries.
So if you kill support for an old game, it will probably never be updated since it's no longer commercially relevant. Publishers are probably almost happy when old games get broken since they can sell you newer ones easier.
So even if they have kept the old OpenGL version that they had, many newer OpenGL-based applications cannot run on MacOS.
Since OpenGL is no longer evolving, it would not have been a great effort to bring the OpenGL support to the last version, and only then freeze it.
The Arm PC Base System Architecture 1.0 (PC-BSA) specifies a standard hardware system architecture for Personal Computers (PCs) that are based on the Arm 64-bit Architecture. PC system software, for example operating systems, hypervisors, and firmware can rely on this standard system architecture. PC-BSA extends the requirements specified in the Arm BSA.
--
"We use Rosetta to emulate x86 programs on Apple Silicon, which is much faster than the commonly-used QEMU."
What I get is that rosetta is used when you run something on Docker that uses x86 architecture (I'm guessing x86_64), which for me is pretty often.
--
1: https://docs.orbstack.dev/architecture#low-level-vm-optimiza...