Optimizing the Ion compiler back end
159 points
1 day ago
| 9 comments
| spidermonkey.dev
| HN
rpearl
1 day ago
[-]
Wow, that dominator tree algorithm I wrote as an intern (https://bugzilla.mozilla.org/show_bug.cgi?id=666426) seems to have lasted 13 years! At the time we assumed that graphs would be small enough to not warrant the constant factor against Lengauer-Tarjan. ...And of course, browser-based apps have gotten much bigger since then.

Semi-NCA hadn't even been published yet and seems like the clear choice nowadays, so I'm glad to see it in place! Great work!

reply
mananaysiempre
1 day ago
[-]
> Semi-NCA hadn't even been published yet and seems like the clear choice nowadays[.]

For those who are awkwardly lingering and casting longing glasses at the entrance door of compiler engineering like I am, and who were just as dismayed by this sentence, it wasn’t “properly” published but looks to have been described in a thesis from 2005[1] and in an extended abstract (ugh) before that[2].

But also, the reduction of RMQ to NCA, really?.. Ouch. I’m having flashbacks to my (very brief) competitive programming days, and not the good kind.

[1] https://www.cs.princeton.edu/research/techreps/TR-737-05

[2] https://www.cse.uoi.gr/~loukas/index.files/dominators_soda04...

reply
DannyBee
1 day ago
[-]
It was published prior to that in a paper named "Finding dominators in practice", published in ESA 2004[1].

For once, the title is not actually an oversell, it actually covers that topic quite well for a conference paper.

[1] https://link.springer.com/chapter/10.1007/978-3-540-30140-0_...

reply
jandem
1 day ago
[-]
Nice to hear from you! Yes it was a very reasonable choice back then, before (enormous) asm.js and Wasm functions.

The compiler wasn't really designed for these huge graphs, but fortunately a lot of the issues can be fixed incrementally and it's still holding up well.

reply
DannyBee
1 day ago
[-]
Amusingly, the LLVM implementation was also written by an intern at one point :)

Semi-NCA was actually published back then, just not in the cited paper.

See "Finding dominators in practice", published in 2004, section 2.3. So two decades ago.

The main advantage of Semi-NCA (and why LLVM uses it) is that it makes for a good incremental update algorithm. See https://reviews.llvm.org/D34258

Truthfully, as much as I love Cooper and friends (they were responsible for a lot of very thoughtful, well engineered algorithms at a time when lots of stuff was just "here's some research i think is better"), the "simple" dataflow based algorithm was never worth it.

Part of the thing i was good at way back then was keeping track of all of the research in compiler opts and which was any good - most of their stuff is very good.

I used to go through every paper that came around and keep an up to date library of ones worth looking at (both now and in the future) that a bunch of folks used. This was harder back in the days of nobody really publishing code, i used to have to write a ton of prototype implementations to see which numbers were real and which were BS because they compared against crappy implementations or whatever.

SEMI-NCA was an obvious win - it was simple enough to implement and test, equally as fast as what existed now, and could easily be extended to incremental updates.

If you want to see what it takes to do incremental updates with LT, take a look at GCC's dominator update code back around that time period (I think it's unchanged since then, actually, but i haven't looked in a few years). There were a fairly small number of people who could understand the data structures and algorithms involved.

reply
koala_man
1 day ago
[-]
> extremely large functions > quadratic behavior

*High five*

I fixed several of these during my time as a compiler engineer too.

It's true what they say, quadratic is the worst possible time complexity. Fast enough to work on all your test cases, slow enough to explode in prod.

reply
kibwen
1 day ago
[-]
What is MIR in this context? On the one hand, given the mention of Cranelift, it seems like it could be referring to the Rust compiler's intermediate representation, but given the context perhaps it's referring to an independent intermediate representation that also just happens to be called MIR?
reply
throwaway17_17
1 day ago
[-]
MIR is the compiler intermediate representation used by Ion. I know the IonMonkey docs say the acronym is Mid-level Intermediate Representation.
reply
pjmlp
23 hours ago
[-]
MIR is quite overloaded, besides that and Rust, it is also the new modern IR model used by LLVM languages going forward.
reply
pshc
1 day ago
[-]
In modern times I seldom reach for a linked list... cache friendly data structures almost always win.
reply
dist1ll
1 day ago
[-]
Yup. Almost every DS in my compiler is either an array or a hashmap.
reply
tlb
23 hours ago
[-]
> control flow graph contained 132856 basic blocks

That is a stunningly large function. I've looked around the onnxruntime sources and can't find anything like it. The largest C file is under 6000 lines. Does anyone know what function it's referring to?

reply
flohofwoe
21 hours ago
[-]
IME Clang/LLVM can be extremely agressive about inlining. Some of my home computer emulators are (almost) collapsed into a single massive function (the compiler can see all function bodies in my build setup even without LTO).

Also the CPU emulator tick function which is a single huge switch statement with several thousand case branches was actually running into a slow path in Clang a couple of years ago (it took like 30 seconds to build that one source file). This was fixed at some point though.

I also seem to remember that Emscripten had a build setting to break up large WASM functions into smaller snippets to work around such quadratic outliers in browser WASM engines.

reply
cjblomqvist
1 day ago
[-]
If anyone have a comparison with V8 that would be great!
reply
emmanueloga_
1 day ago
[-]
Not sure what kind of comparison you mean, but you can compare desktop browsers with [1].

I just ran it on my mac M2 Max and got:

    (F)irefox 131.0.3
    (E)dge 129.0, V8 12.9.17.8
    (S)afari 18.0 (20619.1.26.31.6)

    Speedometer 3.0
    (F) 25.9  
    (E) 22.3
    (S) 30.9

    JetStream2
    (F) 251.787
    (E) 350.74
    (S) 360.568
Safari seems slightly faster in all benchmarks. I did not run motionmark because it takes forever :-/. The page says JetStream2 is what you want if you want to benchmark wasm.

How this relates to TFA, no idea ... is not really easy to tell which version of SpiderMonkey is running on the installed Firefox.

--

1: https://browserbench.org/

reply
KwanEsq
1 day ago
[-]
Spidermonkey just follows Firefox version numbering, so far as I know, and the linked bugs in the article seem to have landed in a mix of the 132 and 133 milestones, so you'll have to wait a couple of release cycles for the full effect.
reply
flohofwoe
21 hours ago
[-]
AFAIK Chrome already does the "function by function" ompilation approach that's hinted at the end of the article.
reply
kevingadd
1 day ago
[-]
https://arewefastyet.com/ has various comparisons between Firefox and Chrome, the script oriented ones are basically Spidermonkey vs V8
reply
zyedidia
1 day ago
[-]
Is there any AOT WebAssembly compiler that can compile Wasm used by websites? I tried locally compiling the Photoshop Wasm module mentioned in the article but the compilers I tried (Wasmtime, wasm2c, WAMR) all complained about some unsupported Wasm extension/proposal being required (exceptions seems like the blocker on wasmtime, and the others gave cryptic error messages).

Is it really the case that browsers have default-enabled all sorts of extensions that are not yet widely supported by the rest of the ecosystem?

reply
aseipp
1 day ago
[-]
No, I think most of the AOT compilers in practice are a bit behind V8 and/or Spidermonkey for newer features. Realistically, most development driving new WASM features is motivated by website-ish use cases. Exception handling in particular is still not standardized so I guess it's expected that the browser engines would be the one to have the most evolving support (and the ability to test it thoroughly) because people want that platform as their inevitable target anyway.
reply
Dylan16807
1 day ago
[-]
> Is it really the case that browsers have default-enabled all sorts of extensions that are not yet widely supported by the rest of the ecosystem?

I don't know the answer, but it would be hard to blame them for following normal browser development practices on the standard they created for the purpose of being in browsers.

reply
zyedidia
1 day ago
[-]
Fair enough. I think it would be unfortunate if the WebAssembly language in browsers were a significantly different language than WebAssembly outside of browsers (just referring to language itself, not the overall runtime system). I don't think that has quite happened, and the outer ecosystem can probably catch up, but it worries me.
reply
titzer
1 day ago
[-]
Non-browser environments are a little behind on the Wasm standard, but not by much. E.g. wasmtime has now landed support for Wasm GC. AFAIK they implement all phase 4 proposals. Wizard implements all the Phase 4 proposals as well. The Wasm 3.0 spec will be out soon, which will be a big milestone to motivate Wasm engines outside the Web to catch up.
reply
pjmlp
22 hours ago
[-]
We already had plenty of bytecode formats outside the browser since UNCOL was an idea in 1958, including as replacement for Assembly, with microcoded CPUs.

Now we get a couple of startups trying to make WebAssembly outside of the browser as if was a novel idea, never done before.

reply
kevingadd
1 day ago
[-]
> Is there any AOT WebAssembly compiler that can compile Wasm used by websites?

This doesn't really make sense, given that the wasm used by websites is going to import a bunch of JS functions as dependencies. You're not going to have those available in any native environment.

> Is it really the case that browsers have default-enabled all sorts of extensions that are not yet widely supported by the rest of the ecosystem?

Yes

Photoshop in particular is a good example of a bleeding edge wasm app - browsers had to relax restrictions on things like function pointer table size in order for it to work. So I wouldn't expect it to build and run anywhere outside of v8 or spidermonkey.

reply
zyedidia
1 day ago
[-]
I think one of the interesting aspects of WebAssembly compared to JavaScript is that it can be efficiently AOT-compiled. I've been interested in investigating AOT compilation for a browser (perhaps there is a distant/alternative future where web browsing does not require a JIT), but maybe Wasm AOT compilers aren't really there yet.
reply
kevingadd
1 day ago
[-]
Functionally what browsers do under the hood is AOT compilation but not in the way that i.e. Wasmtime is. The following is my knowledge that may no longer be accurate, but this is the sort of model we planned for when designing WASM to begin with:

Browsers do a on-demand 'first tier' compilation for fast startup, and in the background using threads they do a high quality AOT compilation of the whole module. That high quality AOT compilation is cached and used for future page loads.

It is possible to use a JIT model for this but AFAIK it is typically not done. In some configurations (i.e. lockdown mode) WASM is interpreted instead of AOT compiled.

reply
hencq
1 day ago
[-]
> It is possible to use a JIT model for this but AFAIK it is typically not done.

Isn't this what the last line of the article is hinting at? > our WebAssembly team is making great progress rearchitecting the Wasm compiler pipeline. This work will make it possible to Ion-compile individual Wasm functions as they warm up instead of compiling everything immediately.

reply
tombl
1 day ago
[-]
I believe this is still true. Originally you could store a compiled wasm module in IndexedDB and cache it manually, but that was removed in favor of {instantiate,compile}Streaming, which take a HTTP response and hook into the existing HTTP caching layer, which was already storing the compliation output for JS.
reply
icsa
1 day ago
[-]
Moral of the story:

First, use an array. If an array doesn't work, try something else.

reply
kevingadd
1 day ago
[-]
It's definitely the case that in many scenarios, the right data structure is an array, since you'll have work dominated by gets and sets.

However, the OP has one scenario where the opposite was true - they were using a dense bitset that needed to be obscenely huge because of degenerate code in the wasm module, so swapping to a sparse container was a win.

In the end you just have to profile and understand your data.

reply
icsa
1 day ago
[-]
To your point, profile your data as you would your code.

A sorted array of bit locations would represent a sparse bit set well enough to start, with O(N) storage and O(log N) access. Once the sets became large and/or dense, another data structure could be considered.

reply
mgaudet
1 day ago
[-]
It's too bad the title prefix "75x Faster" got dropped.
reply