C++26 Shipped a SIMD Library Nobody Asked For
46 points
2 days ago
| 3 comments
| lucisqr.substack.com
| HN
jandrewrogers
1 hour ago
[-]
I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.

I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?

The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.

reply
mgaunard
1 hour ago
[-]
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
reply
jandrewrogers
49 minutes ago
[-]
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.

In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.

reply
mattip
11 minutes ago
[-]
NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
reply
cortesoft
27 minutes ago
[-]
Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?
reply
mattip
13 minutes ago
[-]
There is google’s highway, that provides an abstraction layer. It is used by NumPy.
reply
loeg
18 minutes ago
[-]
Google Highway gets mentioned in the article.
reply
mpyne
45 minutes ago
[-]
> I think a legitimate criticism is that it is unclear who std::simd is for.

I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.

Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.

Just like I'd rather use a ranged-for than to hand count an index vs. a size.

> People that don’t use SIMD today are unlikely to use std::simd tomorrow.

I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.

And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.

In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.

reply
jandrewrogers
8 minutes ago
[-]
You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.

The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.

An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.

The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.

SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.

That said, I love that silicon has become so much more expressive.

reply
paulddraper
28 minutes ago
[-]
> I think a legitimate criticism is that it is unclear who std::simd is for

It's for people that don't use SIMD today.

SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.

---

Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.

reply
jandrewrogers
1 minute ago
[-]
My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.

There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.

reply
mgaunard
53 minutes ago
[-]
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).

Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.

There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.

reply
rbanffy
11 minutes ago
[-]
To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
reply
AlotOfReading
21 minutes ago
[-]
Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.

Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.

reply
magicalhippo
1 day ago
[-]
The linked[1] "six reasons to use std::simd" was just what I needed after a long week. Hilarious!

[1]: https://github.com/NoNaeAbC/std_simd

reply
AlotOfReading
1 hour ago
[-]
That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!
reply
mgaunard
1 hour ago
[-]
isn't that just QoI issues? There's a reason why the libstdc++ folks labelled their implementation as experimental.
reply