SIMD programming in pure Rust
78 points
2 days ago
| 9 comments
| kerkour.com
| HN
pizlonator
3 hours ago
[-]
This article references the fact that security issues in crypto libs are memory safety issues, and I think this is meant to be a motivator for writing the crypto using SIMD intrinsics.

This misses two key issues.

1. If you want to really trust that your crypto code has no timing side channels, then you've gotta write it in assembly. Otherwise, you're at the compiler's whims to turn code that seems like it really should be constant-time into code that isn't. There's no thorough mechanism in compilers like LLVM to prevent this from happening.

2. If you look at the CVEs in OpenSSL, they are generally in the C code, not the assembly code. If you look at OpenSSL CVEs going back to the beginning of 2023, there is not a single vulnerability in Linux/X86_64 assembly. There are some in the Windows port of the X86_64 assembly (because Windows has a different calling conv and the perlasm mishandled it). There are some on other arches. But almost all of the CVEs are in C, not asm.

If you want to know a lot more about how I think about this, see https://fil-c.org/constant_time_crypto

I do think it's a good idea to have crypto libraries implemented in memory safe languages, and that may mean writing them in Rust. But the actual kernels that do the cryptographic computations that involve secrets should be written in asm for maximum security so that you can be sure that sidechannels are avoided and because empirically, the memory safety bugs are not in that asm code.

reply
johnisgood
36 minutes ago
[-]
So there are more bugs in a more readable and understandable programming language (C) as opposed to asm? What gives? I am asking because intuition would say the opposite since asm is much more lower-level than C.
reply
itemize123
29 minutes ago
[-]
compiler optimization is a blackbox. shortcuts to crypto routines will allow side channel attacks
reply
johnisgood
24 minutes ago
[-]
Yes I know, I am referring to memory safety bugs which Rust largely eliminates, according to everyone.
reply
shihab
6 hours ago
[-]
> For example, NEON ... can hold up to 32 128-bit vectors to perform your operations without having to touch the "slow" memory.

Something I recently learnt: the actual number of physical registers in modern x86 CPUs are significantly larger, even for 512-bit SIMD. Zen 5 CPUs actually have 384 vectors registers, 384*512b = 24KB!

reply
cmovq
5 hours ago
[-]
This is true, but if you run out of the 32 register names you’ll still need to spill to memory. The large register file is to allow for multiple instructions to execute in parallel among other things.
reply
zeusk
3 hours ago
[-]
They’re used by the internal register renamer/allocator so if it sees you’re storing the results to memory then reusing the named register for a new result - it will allocate a new physical register so your instruction doesn’t stall for the previous write to go through.
reply
dapperdrake
5 hours ago
[-]
In the register file or named registers?

And the critical matrix tiling size is often SRAM, so L3 unified cache.

reply
rwaksmunski
2 days ago
[-]
Every Rust SIMD article should mention the .chunks_exact() auto vectorization trick by law.
reply
ChadNauseam
7 hours ago
[-]
Didn't know about this. Thanks!

Not related, but I often want to see the next or previous element when I'm iterating. When that happens, I always have to switch to an index-based loop. Is there a function that returns Iter<Item=(T, Option<T>)> where the second element is a lookahead?

reply
tyilo
6 hours ago
[-]
You probably just want to use `.peekable()`: https://doc.rust-lang.org/stable/std/iter/trait.Iterator.htm...
reply
dfajgljsldkjag
6 hours ago
[-]
The benchmarks on Zen 5 are absolutely insane for just a bit of extra work. I really hope the portable SIMD module stabilizes soon, so we do not have to keep rewriting the same logic for NEON and AVX every time we want to optimize something. That example about implementing ChaCha20 twice really hit home for me.
reply
crote
2 days ago
[-]
What is the "nasty surprise" of Zen 4 AVX512? Sure, it's not quite the twice as fast you might initially assume, but (unlike Intel's downclocking) it's still a strict upgrade over AVX2, is it not?
reply
cogman10
7 hours ago
[-]
It's splitting a 512 instruction into 2 256 instructions internally. That's the main nasty surpise.

I suppose it saves on the decoding portion a little but it's ultimately no more effective than just issuing the 2 256 instructions yourself.

reply
adgjlsfhk1
8 minutes ago
[-]
Predicated instructions are incredibly useful (and avx-512 only). They let you get rid of the usual tail handling at the end of the loop.
reply
nwallin
1 hour ago
[-]
Single pumped AVX512 can still be a lot more effective than double pumped AVX2.

AVX512 has 2048 bytes of named registers; AVX2 has 512 bytes. AVX512 uses out of band registers for masking, AVX2 uses in band mask registers. AVX512 has better options for swizzling values around. All (almost all?) AVX512 instructions have masked variants, allowing you to combine an operation and a subsequent mask operation into a single operation.

Often times I'll write the AVX512 version first, and go to write the AVX2 version, and a lot of the special sauce that made the AVX512 version good doesn't work in AVX2 and it's real awkward to get the same thing done.

reply
MobiusHorizons
7 hours ago
[-]
The benefit seems to be that we are one step closer to not needing to have the fallback path. This was probably a lot more relevant before Intel shit the bed with consumer avx-512 with e-cores not having the feature
reply
convolvatron
7 hours ago
[-]
axv-512 for zen4 also includes a bunch of instructions that weren't in 256, including enhanced masking, 16 bit floats, bit instructions, double-sized double-width register file
reply
fooker
5 hours ago
[-]
> it's still a strict upgrade over AVX2

If you benchmark it, it will be slower about half the time.

reply
adgjlsfhk1
12 minutes ago
[-]
for the simplest cases it will be about the same speed as avx2, but if you're trying to do anything fancy, the extra registers and instructions are a godsend.
reply
wyldfire
4 hours ago
[-]
Does Rust provide architecture-specific intrinsics like C/C++ toolchains usually do? That's a popular way to do SIMD.
reply
steveklabnik
3 hours ago
[-]
reply
m-hilgendorf
46 minutes ago
[-]
People should be aware though that without `-C target_feature=+<feature>` in your rustc flags the compiler may emit function calls to stubs for the intrinsic. So people should make sure they're passing the appropriate target features, especially when benchmarking.

[0] https://godbolt.org/z/85nx44zcE

edited: I tested gcc/clang and they just straight up fail to compile without -msse3. The generated code without optimizations is also pretty bonkers!

reply
karavelov
3 hours ago
[-]
Yes, in the submodules of `std::arch`
reply
jeffbee
6 hours ago
[-]
"Intel CPUs were downclocking their frequency when using AVX-512 instructions due to excessive energy usage (and thus heat generation) which led to performance worse than when not using AVX-512 acceleration."

This is an overstatement so gross that it can be considered false. On Skylake-X, for mixed workloads that only had a few AVX-512 instructions, a net performance loss could have happened. On Ice Lake and later this statement was not true in any way. For code like ChaCha20 it was not true even on Skylake-X.

reply
rurban
41 minutes ago
[-]
This was written in the past tense, and was true in the last decade. Only recently Intel came up with proper AVX-512
reply
jeffbee
28 minutes ago
[-]
It wasn't. My comment covers the entire history of the ISA extension on Intel Xeon CPUs.
reply
celrod
26 minutes ago
[-]
I netted huge performance wins out of AVX512 on my Skylake-X chips all the time. I'm excited about less downclocking and smarter throttling algorithms, but AVX512 was great even without them -- mostly just hampered by poor hardware availability, poor adoption in software, and some FUD.
reply
cl0ckt0wer
3 hours ago
[-]
Yeah I would have loved benchmarks across generations and vendors.
reply
nice_byte
2 hours ago
[-]
it's hard to believe that using simd extensions in rust is still as much of a chaotic clusterfudge as it was the first time I looked into it. no support in standard library? 3 different crates? might as well write inline assembly...
reply
steveklabnik
2 hours ago
[-]
The standard library offers intrinsics but not a portable high level API just yet.
reply
formerly_proven
7 hours ago
[-]
Lazy man's "kinda good enough for some cases SIMD in pure Rust" is to simply target x86-64-v3 (RUSTFLAGS=-Ctarget-cpu=x86-64-v3), which is supported by all AMD Zen and Intel CPUs since Haswell; and for floating point code, which cannot be auto-vectorized due to the accuracy implications, "simply" write it with explicit four or eight-way lanes, and LLVM will do the rest. Usually. Loops may need explicit handling of head or tail to auto-vectorize (chunks_exact helps with this, it hands you the tail).
reply