BLIS even has mixed precision interfaces. But might not cover more exotic stuff like low-precision ints? So this paper could have had a chance to “put some points on the board” against a real top-tier competitor.
> Libraries such as BLIS [19] lack SME support and are therefore excluded from comparison.
Maybe you want a comparison anyways, but it won't be competitive. On Apple CPUs, SME is ~8x faster than a single regular CPU core with a good BLAS library.
The Apple Silicon CPU Optimization Guide has a lot of great information on SME and SSVE, along with more general information on optimizing for Apple's CPUs
A few quotes from Apple's guide that are particularly relevant to SSVE, from "SSVE Vector Execution Unit Optimization":
> Broadly, this unit is designed to support long vector and matrix operations performed on ZA storage _in the SME Processing Grid_.
> Recommendation: Use SSVE in a supporting role to enable high throughput SME grid computation.
> [Magnitude: High | Applicability: High] SSVE offers wide 64B vectors. While the ISA includes instructions that can operate on multi-vectors, the throughput is often only one 64B vector per cycle. Use SSVE to enable SME, which offers higher parallelism.
> Because of non-speculative execution, communication latencies, and in some cases long memory and computation latencies, SME engine instructions trail execution in the core by dozens to thousands of cycles. Any core compute instructions that consume data produced by the SME engine may have to wait an indeterminate (but long) amount of time for the data to arrive.
If SSVE is slow, I was hoping that SME instructions could be used in a vector-like fashion (e.g. add two matrices with high throughput, or a Hadamard/element-wise product) but it seems most matrix accelerator ISAs don't have that.
GEMMs are dense O(N^3) work operations that have roughly the same access pattern and data reuse properties across all matrices. Of course, I’m simplifying things a lot here; tall-skinny and short-fat patterns are much harder to get performance out of but the spirit of the approach is the same as big square matrices.
Sparse LU solves have a different character. There is nowhere near O(N^3) work. You typically expect something closer to O(N^2) but getting performance out of these operations is notoriously difficult because it depends a lot on the sparsity pattern of the linear system. Making matters worse is that you may commonly have a sparse A that factorises to dense L and/or U matrices.