FilterHN

Optimizing the Ion compiler back end

164 points

by undercut

1 year ago

| past

| 15 comments

| spidermonkey.dev

| HN

▲

rpearl

1 year ago

[-]

Wow, that dominator tree algorithm I wrote as an intern (https://bugzilla.mozilla.org/show_bug.cgi?id=666426) seems to have lasted 13 years! At the time we assumed that graphs would be small enough to not warrant the constant factor against Lengauer-Tarjan. ...And of course, browser-based apps have gotten much bigger since then.

Semi-NCA hadn't even been published yet and seems like the clear choice nowadays, so I'm glad to see it in place! Great work!

▲

mananaysiempre

12 months ago

[-]

> Semi-NCA hadn't even been published yet and seems like the clear choice nowadays[.]

For those who are awkwardly lingering and casting longing glasses at the entrance door of compiler engineering like I am, and who were just as dismayed by this sentence, it wasn’t “properly” published but looks to have been described in a thesis from 2005[1] and in an extended abstract (ugh) before that[2].

But also, the reduction of RMQ to NCA, really?.. Ouch. I’m having flashbacks to my (very brief) competitive programming days, and not the good kind.

[1] https://www.cs.princeton.edu/research/techreps/TR-737-05

[2] https://www.cse.uoi.gr/~loukas/index.files/dominators_soda04...

▲

DannyBee

12 months ago

[-]

It was published prior to that in a paper named "Finding dominators in practice", published in ESA 2004[1].

For once, the title is not actually an oversell, it actually covers that topic quite well for a conference paper.

[1] https://link.springer.com/chapter/10.1007/978-3-540-30140-0_...

▲

jandem

12 months ago

[-]

Nice to hear from you! Yes it was a very reasonable choice back then, before (enormous) asm.js and Wasm functions.

The compiler wasn't really designed for these huge graphs, but fortunately a lot of the issues can be fixed incrementally and it's still holding up well.

▲

DannyBee

12 months ago

[-]

Amusingly, the LLVM implementation was also written by an intern at one point :)

Semi-NCA was actually published back then, just not in the cited paper.

See "Finding dominators in practice", published in 2004, section 2.3. So two decades ago.

The main advantage of Semi-NCA (and why LLVM uses it) is that it makes for a good incremental update algorithm. See https://reviews.llvm.org/D34258

Truthfully, as much as I love Cooper and friends (they were responsible for a lot of very thoughtful, well engineered algorithms at a time when lots of stuff was just "here's some research i think is better"), the "simple" dataflow based algorithm was never worth it.

Part of the thing i was good at way back then was keeping track of all of the research in compiler opts and which was any good - most of their stuff is very good.

I used to go through every paper that came around and keep an up to date library of ones worth looking at (both now and in the future) that a bunch of folks used. This was harder back in the days of nobody really publishing code, i used to have to write a ton of prototype implementations to see which numbers were real and which were BS because they compared against crappy implementations or whatever.

SEMI-NCA was an obvious win - it was simple enough to implement and test, equally as fast as what existed now, and could easily be extended to incremental updates.

If you want to see what it takes to do incremental updates with LT, take a look at GCC's dominator update code back around that time period (I think it's unchanged since then, actually, but i haven't looked in a few years). There were a fairly small number of people who could understand the data structures and algorithms involved.

▲

hu3

12 months ago

[-]

From a glance, they replaced linked lists with vectors, changed dominator tree building to semi-NCA, and used sparse bitsets to reduce memory usage.

Very niche but effective. When people ask why do FAANGs focus on data structure algos during interviews, it's for cases like this. With that said, I understand candidates frustration since most end up working on higher level network services.

▲

koala_man

1 year ago

[-]

> extremely large functions > quadratic behavior

*High five*

I fixed several of these during my time as a compiler engineer too.

It's true what they say, quadratic is the worst possible time complexity. Fast enough to work on all your test cases, slow enough to explode in prod.

▲

kibwen

12 months ago

[-]

What is MIR in this context? On the one hand, given the mention of Cranelift, it seems like it could be referring to the Rust compiler's intermediate representation, but given the context perhaps it's referring to an independent intermediate representation that also just happens to be called MIR?

▲

throwaway17_17

12 months ago

[-]

MIR is the compiler intermediate representation used by Ion. I know the IonMonkey docs say the acronym is Mid-level Intermediate Representation.

▲

pjmlp

12 months ago

[-]

MIR is quite overloaded, besides that and Rust, it is also the new modern IR model used by LLVM languages going forward.

▲

tlb

12 months ago

[-]

> control flow graph contained 132856 basic blocks

That is a stunningly large function. I've looked around the onnxruntime sources and can't find anything like it. The largest C file is under 6000 lines. Does anyone know what function it's referring to?

▲

flohofwoe

12 months ago

[-]

IME Clang/LLVM can be extremely agressive about inlining. Some of my home computer emulators are (almost) collapsed into a single massive function (the compiler can see all function bodies in my build setup even without LTO).

Also the CPU emulator tick function which is a single huge switch statement with several thousand case branches was actually running into a slow path in Clang a couple of years ago (it took like 30 seconds to build that one source file). This was fixed at some point though.

I also seem to remember that Emscripten had a build setting to break up large WASM functions into smaller snippets to work around such quadratic outliers in browser WASM engines.

▲

pshc

1 year ago

[-]

In modern times I seldom reach for a linked list... cache friendly data structures almost always win.

▲

dist1ll

12 months ago

[-]

Yup. Almost every DS in my compiler is either an array or a hashmap.

▲

thejoker20

12 months ago

[-]

Are there any comparisons with other browsers (e.g. Chrome)? How many compilers do other browsers have, how much does it take and how fast are binaries produced?

I would love to see how Firefox compares now with the other big competitors.

▲

tmpfs

12 months ago

[-]

I welcome this and look forward to seeing the improvements. At the moment I am running some TSS code compiled down to WASM. The last time I executed DKG and signing Webkit took ~25s, Chrome took about ~39s and Firefox took nearly 4 minutes so there is clearly lots of room for WASM optimization in Firefox!

▲

xhkkffbf

12 months ago

[-]

Is there any reason to think this is anything but great?

▲

hulitu

12 months ago

[-]

> Is there any reason to think this is anything but great?

like in 75X great, or what ? /s

▲

telgareith

12 months ago

[-]

Ok, but have they fixed their awful javascript performance?

▲

cjblomqvist

1 year ago

[-]

If anyone have a comparison with V8 that would be great!

▲

emmanueloga_

1 year ago

[-]

Not sure what kind of comparison you mean, but you can compare desktop browsers with [1].

I just ran it on my mac M2 Max and got:

    (F)irefox 131.0.3
    (E)dge 129.0, V8 12.9.17.8
    (S)afari 18.0 (20619.1.26.31.6)

    Speedometer 3.0
    (F) 25.9  
    (E) 22.3
    (S) 30.9

    JetStream2
    (F) 251.787
    (E) 350.74
    (S) 360.568

Safari seems slightly faster in all benchmarks. I did not run motionmark because it takes forever :-/. The page says JetStream2 is what you want if you want to benchmark wasm.

How this relates to TFA, no idea ... is not really easy to tell which version of SpiderMonkey is running on the installed Firefox.

1: https://browserbench.org/

▲

KwanEsq

1 year ago

[-]

Spidermonkey just follows Firefox version numbering, so far as I know, and the linked bugs in the article seem to have landed in a mix of the 132 and 133 milestones, so you'll have to wait a couple of release cycles for the full effect.

▲

kevingadd

1 year ago

[-]

https://arewefastyet.com/ has various comparisons between Firefox and Chrome, the script oriented ones are basically Spidermonkey vs V8

▲

flohofwoe

12 months ago

[-]

AFAIK Chrome already does the "function by function" ompilation approach that's hinted at the end of the article.

▲

zyedidia

1 year ago

[-]

Is there any AOT WebAssembly compiler that can compile Wasm used by websites? I tried locally compiling the Photoshop Wasm module mentioned in the article but the compilers I tried (Wasmtime, wasm2c, WAMR) all complained about some unsupported Wasm extension/proposal being required (exceptions seems like the blocker on wasmtime, and the others gave cryptic error messages).

Is it really the case that browsers have default-enabled all sorts of extensions that are not yet widely supported by the rest of the ecosystem?

▲

kevingadd

1 year ago

[-]

> Is there any AOT WebAssembly compiler that can compile Wasm used by websites?

This doesn't really make sense, given that the wasm used by websites is going to import a bunch of JS functions as dependencies. You're not going to have those available in any native environment.

> Is it really the case that browsers have default-enabled all sorts of extensions that are not yet widely supported by the rest of the ecosystem?

Yes

Photoshop in particular is a good example of a bleeding edge wasm app - browsers had to relax restrictions on things like function pointer table size in order for it to work. So I wouldn't expect it to build and run anywhere outside of v8 or spidermonkey.

▲

zyedidia

1 year ago

[-]

I think one of the interesting aspects of WebAssembly compared to JavaScript is that it can be efficiently AOT-compiled. I've been interested in investigating AOT compilation for a browser (perhaps there is a distant/alternative future where web browsing does not require a JIT), but maybe Wasm AOT compilers aren't really there yet.

▲

kevingadd

12 months ago

[-]

Functionally what browsers do under the hood is AOT compilation but not in the way that i.e. Wasmtime is. The following is my knowledge that may no longer be accurate, but this is the sort of model we planned for when designing WASM to begin with:

Browsers do a on-demand 'first tier' compilation for fast startup, and in the background using threads they do a high quality AOT compilation of the whole module. That high quality AOT compilation is cached and used for future page loads.

It is possible to use a JIT model for this but AFAIK it is typically not done. In some configurations (i.e. lockdown mode) WASM is interpreted instead of AOT compiled.

▲

hencq

12 months ago

[-]

> It is possible to use a JIT model for this but AFAIK it is typically not done.

Isn't this what the last line of the article is hinting at? > our WebAssembly team is making great progress rearchitecting the Wasm compiler pipeline. This work will make it possible to Ion-compile individual Wasm functions as they warm up instead of compiling everything immediately.

▲

tombl

12 months ago

[-]

I believe this is still true. Originally you could store a compiled wasm module in IndexedDB and cache it manually, but that was removed in favor of {instantiate,compile}Streaming, which take a HTTP response and hook into the existing HTTP caching layer, which was already storing the compliation output for JS.

▲

aseipp

1 year ago

[-]

No, I think most of the AOT compilers in practice are a bit behind V8 and/or Spidermonkey for newer features. Realistically, most development driving new WASM features is motivated by website-ish use cases. Exception handling in particular is still not standardized so I guess it's expected that the browser engines would be the one to have the most evolving support (and the ability to test it thoroughly) because people want that platform as their inevitable target anyway.

▲

Dylan16807

1 year ago

[-]

> Is it really the case that browsers have default-enabled all sorts of extensions that are not yet widely supported by the rest of the ecosystem?

I don't know the answer, but it would be hard to blame them for following normal browser development practices on the standard they created for the purpose of being in browsers.

▲

zyedidia

1 year ago

[-]

Fair enough. I think it would be unfortunate if the WebAssembly language in browsers were a significantly different language than WebAssembly outside of browsers (just referring to language itself, not the overall runtime system). I don't think that has quite happened, and the outer ecosystem can probably catch up, but it worries me.

▲

titzer

1 year ago

[-]

Non-browser environments are a little behind on the Wasm standard, but not by much. E.g. wasmtime has now landed support for Wasm GC. AFAIK they implement all phase 4 proposals. Wizard implements all the Phase 4 proposals as well. The Wasm 3.0 spec will be out soon, which will be a big milestone to motivate Wasm engines outside the Web to catch up.

▲

pjmlp

12 months ago

[-]

We already had plenty of bytecode formats outside the browser since UNCOL was an idea in 1958, including as replacement for Assembly, with microcoded CPUs.

Now we get a couple of startups trying to make WebAssembly outside of the browser as if was a novel idea, never done before.

▲

icsa

1 year ago

[-]

Moral of the story:

First, use an array. If an array doesn't work, try something else.

▲

kevingadd

1 year ago

[-]

It's definitely the case that in many scenarios, the right data structure is an array, since you'll have work dominated by gets and sets.

However, the OP has one scenario where the opposite was true - they were using a dense bitset that needed to be obscenely huge because of degenerate code in the wasm module, so swapping to a sparse container was a win.

In the end you just have to profile and understand your data.

▲

icsa

12 months ago

[-]

To your point, profile your data as you would your code.

A sorted array of bit locations would represent a sparse bit set well enough to start, with O(N) storage and O(log N) access. Once the sets became large and/or dense, another data structure could be considered.

▲

mgaudet

1 year ago

[-]

It's too bad the title prefix "75x Faster" got dropped.

▲

wly_cdgr

12 months ago

[-]

Nice, that might make it almost as fast as Chrome!

Jokes aside, this is garbage clickbait - the reality as stated in the linked post is much more mundane: "Some processing tasks are now more than 75 times faster in Firefox"