FilterHN

Show HN: RunMat – runtime with auto CPU/GPU routing for dense math

13 points

2 hours ago

| 1 comment

Hi, I’m Nabeel. In August I released RunMat as an open-source runtime for MATLAB code that was already much faster than GNU Octave on the workloads I tried. https://news.ycombinator.com/item?id=44972919

Since then, I’ve taken it further with RunMat Accelerate: the runtime now automatically fuses operations and routes work between CPU and GPU. You write MATLAB-style code, and RunMat runs your computation across CPUs and GPUs for speed. No CUDA, no kernel code.

Under the hood, it builds a graph of your array math, fuses long chains into a few kernels, keeps data on the GPU when that helps, and falls back to CPU JIT / BLAS for small cases.

On an Apple M2 Max (32 GB), here are some current benchmarks (median of several runs):

* 5M-path Monte Carlo * RunMat ≈ 0.61 s * PyTorch ≈ 1.70 s * NumPy ≈ 79.9 s → ~2.8× faster than PyTorch and ~130× faster than NumPy on this test.

* 64 × 4K image preprocessing pipeline (mean/std, normalize, gain/bias, gamma, MSE) * RunMat ≈ 0.68 s * PyTorch ≈ 1.20 s * NumPy ≈ 7.0 s → ~1.8× faster than PyTorch and ~10× faster than NumPy.

* 1B-point elementwise chain (sin / exp / cos / tanh mix) * RunMat ≈ 0.14 s * PyTorch ≈ 20.8 s * NumPy ≈ 11.9 s → ~140× faster than PyTorch and ~80× faster than NumPy.

If you want more detail on how the fusion and CPU/GPU routing work, I wrote up a longer post here: https://runmat.org/blog/runmat-accel-intro-blog

You can run the same benchmarks yourself from the GitHub repo in the main HN link. Feedback, bug reports, and “here’s where it breaks or is slow” examples are very welcome.

▲

constantcrying

54 minutes ago

[-]

Writing a (somewhat?) Matlab compatible interpreter and runtime, which targets GPU and CPU simultaneously, is certainly impressive.

But, who is this for? Matlab users? Python users? Julia users? Do you have an aim with this project or is it just for fun?

▲

salvesefu

39 minutes ago

[-]

From the Website: "If you write math in MATLAB and hit performance walls on CPU, RunMat is built for you."

▲

nallana

20 minutes ago

[-]

Thanks!! It was originally for Octave users whose scripts were running painfully slow.

The goal was to keep the MATLAB frontend capture syntax, but run it fast.

When we dug into why people were still using Octave, it was because it let them focus on their math, and was easier for them to read - was especially important for people that aren’t programmers; eg scientists and engineers.

I suppose this is also why we write in higher level languages than assembly.

The goal of this project is now: let’s make the fastest runtime in the world to run math.

Turned out, the MATLAB syntax offers a large amount of compiler time hinting in (it is meant for math intent capture after all).

We’ve found as we built this that if we take a domain specific approach (eg we’re going to make every optimization for what’s best for people wanting to focus on the math part), we can outperform general purpose languages like Python by a large mile on the math part.

For example, internals like keeping tensor shapes + broadcasting intent within the AST, and having the computation graph available for profitable GPU/CPU threshold detection isn’t something that makes practical sense to build into a general purpose runtime like Python, but —

It lets RunMat speed up elementwise math orders of magnitude (eg 1B points going through 5-6 element wise ops like sin/cos/+/- etc are 80x faster on my MBP vs Python/PyTorch).

So Tl;dr — started as for Octave users. Goal is to build the fastest runtime for math for those that are looking to use computers to do math.

Obligatory disclosure because we’re engineers: you can still get faster by writing your own CUDA / GPU code. We’re betting 99% of the people that are trying to run math using computers don’t want to do that (ML community notwithstanding).