https://blog.timhutt.co.uk/riscv-vector/
It has a visualisation of the element selection stuff at the end.
You can implement both regular SIMD ISAs and scalable SIMD/Vector ISAs in a "Vector processor" style and both in a regular SIMD style.
My point was about the underlying hardware implementation, specifically:
> "As shown in Figure 1-3, array processors scale performance spatially by replicating processing elements, while vector processors scale performance temporally by streaming data through pipelined functional units"
Applies to the hadware implementation, not the ISA, which is not made clear by the text.
You can implement AVX-512 with smaler data path then register width and "scale performance temporally by streaming data through pipelined functional units". Zen4 is a simple example of this, but there is nothing stopping you from implementing AVX-512 on top of heavily temporaly pipelined 64-bit wide execution units.
Similarly, you can implement RVV with a smaller data path than VLEN, but you can also implement it as a bog-standard SIMD processor. The only thing that slightly complicates the comparison is LMUL, but it is fundamentally equivilant to unrolling.
The substantial difference between Vector and SIMD ISAs is imo only the existence of a vl-based predication mechanism. If a SIMD ISA has a fixed register width or not, allowing you to write vector-length agnostic code, is an independent dimension of the ISA design. E .g. the Cray-1 was without a doubt a Vector processor, but the vector registers on all compatible platforms had the exact same length. It did, however, have the mentioned vl-based predication mechanism. You could take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, and end up with a scalable SIMD ISA, for which you can write vector length agnostic code, but that doesn't make it a Vector ISA any more than it was before.
RISC-V Vector is definitely tricky to get a handle on, especially if you just read the architecture documentation (which is to be expected really, good specification for an architecture isn't compatible with a useful beginners guide). I found I needed to look at some presentations given by various members of the vector working group to get a good grasp of the principles.
There's been precious little material beyond the specification and some now slightly old slide decks so this is a great contribution.
The specification for an architecture is meant to be useful to anyone writing assembly, not just to people implementing the spec. Case in point x86 manuals aren't meant for Intel, they're meant for Intel's customers.
There is a lot of cope re the fact RISC-V's spec is particularly hard to use for writing assembly or understanding the software model.
If the spec isn't a 'manual' then where's the manual? If there's just no manual then that's a deficiency. If we only have 'tutorial's that's bad as well, a manual is a good reference for an experienced user, and approachable to a slightly aware beginner (or a fresh beginner with experience in other arch's); a tutorial is too verbose to be useful as a regular reference.
Either the spec should have read (and still could read) more like a useful manual, or a useful manual needs to be provided.
Not quite. It still is the same “process whatever number of items you can in parallel, decrease count by that, repeat if necessary“ loop.
RISC-V decided to move the “decrease count by that, repeat if necessary” part into hardware, making the entire phrase “how the hardware works”.
Makes for shorter and nicer assembly. SIMD without it first has to query the CPU to find out how much parallelization it can handle (once) and do the “decrease count by that, repeat if necessary” part on the main CPU.
IIRC libc for x64 has several implementations of memcpy/memmov/strlen/etc. for different SSE/AVX extensions, which all get compiled in and shipped to your system; when libc is loaded for the first time, it figures out what is the latest extension the CPU it's running on actually supports and then patches its exports to point to the fastest working implementations.
This kinda (though admittedly not entirely) balances out the x86 problem - sure, you have to write a new loop to take advantage of wider vector registers, but you often want to do that anyway - on SSE→AVX(2) you get to take advantage of non-destructive ops, all inline loads being unaligned, and a couple new nice instrs; on AVX2→AVX512 you get a ton of masking stuff, non-awful blends, among others.
RVV gets an advantage here largely due to just simply being a newer ISA, at a time where it is actually reasonably possible for even baseline hardware to support expensive compute instrs, complex shuffles, all unaligned mem ops (..though, actually, with RISC-V/RVV not mandating unaligned support (and allowing it to be extremely-slow even when supported) this is another thing you may want to write multiple loops for), and whatnot; whereas x86 SSE2 had to work on whatever could exist 20 years ago, and as such made respective compromises.
In some edge-cases the x86 approach can even be better - if you have some code that benefits from having different versions depending on hardware vector size (e.g. needs to use vrgather, or processes some fixed-size data that'd be really bad to write in a scalable way), on RVV you may end up needing to write a loop for each combination of VLEN and extension-set (i.e. a quadratic number of cases), whereas on x86 you only need to have a version of the loop for each desired extension-set.
I’d love a similar document for ARM NEON as well.
Also, where does that 38-byte stride even comes from? That number is not even divisible by 4, nevermind by 8!