FilterHN

2 days ago

[-]

I tried to run the benchmark on my M1 Pro macbook, but the "baseline" is written with x86 intrinsics and won't compile.

Are the benchmark results in the README real? (The README itself feels very AI-generated)

Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!

Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)

2 days ago

[-]

Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)

For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.

Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.

The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).

deadmutex

1 day ago

[-]

Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.

  $ ./out/bytepack_eval
  Bytepack Bench — 16 KiB, reps=20000 (pinned if available)
  Throughput GB/s

  K  NEON pack   NEON unpack  Baseline pack   Baseline unpack
  1  94.77       84.05        45.01           63.12          
  2  123.63      94.74        52.70           66.63          
  3  94.62       83.89        45.32           68.43          
  4  112.68      77.91        58.10           78.20          
  5  86.96       80.02        44.32           60.77          
  6  93.50       92.08        51.22           67.20          
  7  87.10       79.53        43.94           57.95          
  8  90.49       92.36        68.99           83.88

1 day ago

[-]

Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.

I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.

ozgrakkurt

2 days ago

[-]

Super cool!

Pretty sure anyone going into this kind of post about simd would prefer your writing to llm

Asmod4n

2 days ago

[-]

Maybe this could help you: https://github.com/simd-everywhere/simde/issues/1099

2 days ago

[-]

But this project isn't using simd-everywhere. I'd like to reproduce the results as documented in the README

guipsp

2 days ago

[-]

Look at the parent dir. I agree it is a bit confusing

2 days ago

[-]

Ah! Yup, that works, I can compile the binary. I get an "Illegal instruction" error when I run it but that's probably just because M1 doesn't support some of the NEON instructions. I retract my implicit AI-slop accusations.

2 days ago

[-]

Results from M1 Pro (after setting CPU=native in the makefile): https://gist.github.com/DavidBuchanan314/e3cde76e4dab2758ec4...

https://community.arm.com/arm-community-blogs/b/servers-and-...

danlark1

2 days ago

[-]

Great work!

Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

Sesse__

2 days ago

[-]

I never understood why they couldn't just include a movmskb instruction to begin with. It's massively useful for integer tasks, not expensive to implement as far as I know, and the vshrn+mov trick often requires an extra instruction either in front or after (in addition to just being, well, pretty obscure).

NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.

2 days ago

[-]

Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).

TinkersW

2 days ago

[-]

8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.

dzaima

2 days ago

[-]

More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.

namibj

1 day ago

[-]

GFNI is distinct from AVX-512; it was merely introduced in cores that also had AVX-512.

Sesse__

2 days ago

[-]

Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?

camel-cdr

1 day ago

[-]

Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog

Sesse__

1 day ago

[-]

I don't know either; I was talking about the lack of PMOVMSKB in NEON and then the comment I replied to started talking about (1 << k) - 1 not being a single instruction sucks. Which I don't think it is in any non-NEON set either.

2 days ago

[-]

Nice article! I personally find the ARM ISA far more cohesive than x86's: far less historical quirks. I also really appreciate the ubiquity of support for 8-bit elements in ARM and the absence of SMT (make performance much more predictable).

robert3005

2 days ago

[-]

Highly recommend https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a comparable algorithm. It generalizes to arbitrary input and output bit widths.

2 days ago

[-]

Good work, wish I saw it earlier as it overlaps with a lot of my recent work. I'm actually planning to release new SOTAs on zigzag/delta/delta-of-delta/xor-with-previous coding next week. Some areas the work doesn't give enough attention to (IMO): register pressure, kernel fusion, cache locality (wrt multi-pass). They also fudge a lot of the numbers and comparisons eg, pitching themselves against Parquet (storage-optimised) when Arrow (compute-optimised) is the most-comparable tech and obvious target to beat. They definitely improve on current best work, but only by a modest margin.

I'm also skeptical of the "unified" paradigm: performance improvements are often realised by stepping away from generalisation and exploiting the specifics of a given problem; under a unified paradigm there's definite room for improvement vs Arrow, but that's very unlikely to bring you all the way to theoretically optimal performance.

6r17

1 day ago

[-]

I feel like I could learn a lot just studying this ; just a curiosity ; how do you know if stuff is within L1 cache or not ? Are there kernel fn for that ? or just trough benching ?

1 day ago

[-]

From the working set size and knowledge of hardware cache behaviour. Whenever you access data from memory not already in-cache it's copied four times: to L3, L2, L1 and to CPU registers. As you access data, the hardware evicts old cache entries to make space for it.

If you loop through an array once, and then iterate through it again you can figure out where it will be cached based on the array size.

saati

1 day ago

[-]

Does it fit 32K? Does it have some weird aliasing issue because you caused cache extinction with too many power of two sizes? And if you don't know the answer to these just check L1d hitrate with perf.

fzeroff

1 day ago

[-]

Very interesting! Uh you don’t mind my basic question, but what would you use that for?

1 day ago

[-]

If you have an array of numbers with a known upper-bound, such as enums with 8 possible values (representable with 3 bits), and a memory-bound operation on those numbers eg, for (int i; i < n; i++) if (user_category[i] == 0) filtered.push_pack(i), which is common in data warehouses, using my code can more than 2x performance by allowing more efficient usage of the DRAM<->CPU bus.

crest

1 day ago

[-]

ARMv8A has nice scalar bit (un)packing instructions. I wonder if NEON is really an improvement over those given that ARM cores tend to have few SIMD ports and NEON is just 128 wide.