This misses two key issues.
1. If you want to really trust that your crypto code has no timing side channels, then you've gotta write it in assembly. Otherwise, you're at the compiler's whims to turn code that seems like it really should be constant-time into code that isn't. There's no thorough mechanism in compilers like LLVM to prevent this from happening.
2. If you look at the CVEs in OpenSSL, they are generally in the C code, not the assembly code. If you look at OpenSSL CVEs going back to the beginning of 2023, there is not a single vulnerability in Linux/X86_64 assembly. There are some in the Windows port of the X86_64 assembly (because Windows has a different calling conv and the perlasm mishandled it). There are some on other arches. But almost all of the CVEs are in C, not asm.
If you want to know a lot more about how I think about this, see https://fil-c.org/constant_time_crypto
I do think it's a good idea to have crypto libraries implemented in memory safe languages, and that may mean writing them in Rust. But the actual kernels that do the cryptographic computations that involve secrets should be written in asm for maximum security so that you can be sure that sidechannels are avoided and because empirically, the memory safety bugs are not in that asm code.
Something I recently learnt: the actual number of physical registers in modern x86 CPUs are significantly larger, even for 512-bit SIMD. Zen 5 CPUs actually have 384 vectors registers, 384*512b = 24KB!
And the critical matrix tiling size is often SRAM, so L3 unified cache.
Not related, but I often want to see the next or previous element when I'm iterating. When that happens, I always have to switch to an index-based loop. Is there a function that returns Iter<Item=(T, Option<T>)> where the second element is a lookahead?
I suppose it saves on the decoding portion a little but it's ultimately no more effective than just issuing the 2 256 instructions yourself.
AVX512 has 2048 bytes of named registers; AVX2 has 512 bytes. AVX512 uses out of band registers for masking, AVX2 uses in band mask registers. AVX512 has better options for swizzling values around. All (almost all?) AVX512 instructions have masked variants, allowing you to combine an operation and a subsequent mask operation into a single operation.
Often times I'll write the AVX512 version first, and go to write the AVX2 version, and a lot of the special sauce that made the AVX512 version good doesn't work in AVX2 and it's real awkward to get the same thing done.
If you benchmark it, it will be slower about half the time.
[0] https://godbolt.org/z/85nx44zcE
edited: I tested gcc/clang and they just straight up fail to compile without -msse3. The generated code without optimizations is also pretty bonkers!
This is an overstatement so gross that it can be considered false. On Skylake-X, for mixed workloads that only had a few AVX-512 instructions, a net performance loss could have happened. On Ice Lake and later this statement was not true in any way. For code like ChaCha20 it was not true even on Skylake-X.