As a HPC developer, it breaks my heart how worse academic software performance is compared to vendor libraries (from Intel or Nvidia). We need to start aiming much higher.
It did make my defense a lot easier because I could just point at the graphs and say “see I beat MKL, whatever I did must work.” But I did a lot of little MPI tricks and tuning, which doesn’t add much to the scientific record. It was fun though.
I don’t know. Mixed feelings. To some extent I don’t really see how somebody could put all the effort into getting a PhD and not go on a little “I want to tune the heck out of these MPI routines” jaunt.
The lack of a “blas/lapack/sparse equivalents that can dispatch to GPU or CPU” is really annoying. You’d think this would be somewhat “easy” (lol, nothing is easy), in the sense that we’ve got a bunch of big chunky operations…
[1] https://github.com/PASSIONLab/OpenEquivariance
They're optimising for different things really.
Intel/Nvidia have the resources to (a) optimise across a wide range of hardware in their libraries (b) often use less well documented things (c) don't have to make their source code publicly accessible.
Take MKL for example - it's a great library, but implementing dynamic dispatch for all the different processor types is why it gets such good performance across x86-64 machines, it's not running the same code on each processor. No academic team can really compete with that.