FilterHN

points

24 days ago

| 1 comment

Yes, unaligned loads/stores are a niche feature that has huge implications in processor design - loads across cache-lines with different residency, pages that fault etc.

This is the classic conundrum of legacy system redesign - if customers keep demanding every feature of the old system be present, and work the exact same then the new system will take on the baggage it was designed to get rid of.

The new implementation will be slow and buggy by this standard and nobody will use it.

▲

0x000xca0xfe

24 days ago

[-]

Unaligned load/store is crucial for zero-copy handling of mmaped data, network streams and all other kinds of space-optimized data structures.

If the CPU doesn't do it software must make many tiny conditional copies which is bad for branch prediction.

This sucks double when you have variable length vector operations... IMO fast unaligned memory accesses should have been mandatory without exceptions for all application-level profiles and everything with vector.

▲

torginus

24 days ago

[-]

I think you can do this fairly efficiently with SSE for x86 - SSE/AVX has shift and shuffle. Encoding/Decoding packed data might even be faster this way.

I'm not familiar with RISC-V but from what I've seen here, they're also trying to solve this similarly with vector or bit extraction instructions.

▲

0x000xca0xfe

24 days ago

[-]

Yes because unaligned load is no problem with SSE/AVX. On my RISC-V OrangePi unaligned vector loads beyond byte-granularity fault so you have to take extra care.

AVX shift and shuffle is mostly limited to 128 bits unfortunately for historical reasons (even for 256-bit instructions) and hardware support for AVX512/AVX10 where they fixed that is a complete mess so it's hard to rely on when you care about backwards compatibility for consumer devices, e.g. in game development.

RISC-V vector has excellent mask/shuffle/permute but the performance in real silicon can be... questionable. See the timings for vrgather here for example: https://camel-cdr.github.io/rvv-bench-results/spacemit_a100/...

For working with packed data structures where fields are irregular/non-predictable/dependent on previous fields etc. unaligned load/store is a godsend. Last time I worked on a custom DB engine that used these patterns the generated x86 code was so much nicer than the one for our embedded ARM cores.