Ohm: A user-friendly parsing toolkit for JavaScript and TypeScript
142 points
1 month ago
| 18 comments
| ohmjs.org
| HN
Cadwhisker
28 days ago
[-]
I'm particularly impressed with the visual PEG debugger tool, here: https://ohmjs.org/editor

PEGs are extremely difficult to debug when things go wrong because you get a MASSIVE trace (if you're lucky) to sift through. This tool has rewind/playback controls to dig into what the PEG rules are doing.

Previously, the best I'd seen was cpp-peglib's online parser, here: https://yhirose.github.io/cpp-peglib/

Edit: and now having read through comments here, Chevrotain's PEG tool looks really impressive, although it doesn't use a simple PEG syntax for the input: https://chevrotain.io/playground/

reply
pdubroy
27 days ago
[-]
Thanks for the kind words :)

I also wrote a blog post about some of the decisions that went into the visualizer: https://dubroy.com/blog/visualizing-packrat-parsing/

reply
Cadwhisker
27 days ago
[-]
I spent some time looking at debugging with Chevrotain and your solution is so much easier to use. That's a great blog post; thanks again.
reply
vanderZwan
27 days ago
[-]
BTW, if you're like me and have been wanting to target WASM bytecode directly for ages without any of the heavy existing toolchains, but unable to put the separate pieces of the spec together yourself: the core maintainer of this library is co-authoring a book for that[0]. I decided to take the risk of giving the early access version a try and it's really nice and accessible.

I'm mentioning it because it implements a toy language using Ohm to explain how WebAssembly works (gee wonder why). So it actually includes a mini-Ohm tutorial.

Funny enough, since WASM bytecode is designed to be compact, and speccing out parsers tends to be a more verbose business, the book technically ends up spending more words on implementing the parser than on WASM itself (all with the goal of developing a mental model for WASM though, which it succeeded at for me so this is not a critique).

[0] https://wasmgroundup.com/

reply
nhatcher
27 days ago
[-]
I am jealous of kids these days learning the theory of parsing. There are so many great resources out there! Ohm in particular looks great, attention to detail, care for the theory. Makes me wish I had a project to try it out.

I am a big fan of PEG parsers. They do come with their set issues and difficulties but I always found them great to work with. My to go tool (also a PEG parser similar to Ohm) used to be pegjs now https://peggyjs.org/

When I needed speed, or a more classical take, I would use jison. But I think today I would have to find a good reason not to build a hand made parser.

reply
vanderZwan
27 days ago
[-]
Yeah, agreed.

I just wish someone would write a "how Marpa parsing works for dummies" article. I've been curious about how it actually works ever since I was colleagues with Aria Stewart, who ported it to JS once upon a time and mentioned it's existence to me (we worked remotely and in opposite time-zones, so I never had a chance to sit down with her for a good old in-person explanation).

[0] https://savage.net.au/Marpa.html

[1] https://jeffreykegler.github.io/personal/timeline_v3

[2] https://github.com/aredridel/lotsawa

reply
fovc
28 days ago
[-]
Great to see this is alive and progressing! I believe Ohm started life in Alan Kay’s research group, to build a graphical OS and office suite in 10k lines of code. I found this talk immensely inspiring https://m.youtube.com/watch?v=ubaX1Smg6pY
reply
pdubroy
27 days ago
[-]
Very close! Alex Warth created OMeta (https://en.wikipedia.org/wiki/OMeta) as part of the STEPS project. Ohm was designed as a kind of successor to OMeta, but was created after STEPS.
reply
agumonkey
27 days ago
[-]
seems like it's his github webpage https://alexwarth.github.io/

and it's full of great stuff

reply
fovc
25 days ago
[-]
Ah thanks for the clarification! Do you happen to know if Nile/Gezira went anywhere?
reply
mhitza
28 days ago
[-]
Can anyone familiar with PEG explain how much it deviates ("inspired by") from "standard" PEG?

I've only used PEG once because of builtin support in Guile scheme, but this library might be a bit more useful, as it's in JS.

reply
pdubroy
27 days ago
[-]
Maintainer of Ohm here :)

The main difference in Ohm grammars vs "standard" PEGs is support for left recursion. It's the same approach that was used in OMeta, which is described in the paper "Packrat parsers can support left recursion": https://web.cs.ucla.edu/~todd/research/pepm08.pdf

The other thing Ohm does that's different from many PEG parsing libraries is a strict separation of syntax and semantics. Many PEG tools mix these two concepts, so you write little snippets of code (semantic actions) alongside the rules.

reply
jitl
28 days ago
[-]
Top thing I want to know about a parser/parser generator library, is will it go fast? After a poor experience with Chevrotain startup performance I’m more open to parser-generators than parsers, I’m very happy to pay for a build step if my compiler will boot up faster (important in the browser) and have better throughout.
reply
kragen
28 days ago
[-]
Ohm is a PEG parser generator. There are theoretical algorithms for making PEG parsers run fast, and people have published some excellent papers on them, but I've never seen them deployed in real software. (Search for PEG automatic cut insertion.) I haven't tried Ohm, but nothing on its home page suggests that it's an exception to the general rule that PEG parsers are very slow.

Probably if what you most want is for your parser to go fast, you should write a predictive recursive descent parser by hand.

reply
pdubroy
27 days ago
[-]
It's worth noting that Python 3 uses a PEG-based parser and its performance (when it was introduced at least) was within 10% of the old parser: https://peps.python.org/pep-0617/
reply
kragen
27 days ago
[-]
In absolute terms, the numbers given in the PEP work out to about a thousand machine instructions per byte of input, for both the old and new parsers. From my point of view, that's pretty slow.

Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing, and which I believe did not change between the old recursive-descent parser and the new PEG parser. Typically the tokenizer must iterate over some 32 characters per line of input code, which it reduces to some 8 tokens for the benefit of the parser proper, so usually the tokenizer is the bottleneck. However, the performance in this case seems too poor to attribute to the tokenizer.

That PEP also discusses some avoidable inefficiencies in the old parser that accounted for a substantial fraction of its runtime.

One of the biggest advantages of PEGs is that you don't need a separate tokenization pass. But that means you are incurring Packrat's inefficiency per character instead of per token. As I said above, I believe the CPython implementation does not do this.

I've written and worked on PEG parser generators, and I think PEG parsing is the best option in a wide variety of circumstances. But it is rarely the fastest option and usually far from it.

reply
pdubroy
27 days ago
[-]
Yes, you're right that PEG parsing is usually not the fastest option. As with many things, the operative question is: is it fast _enough_?

> Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing

Hmmm, I've never heard this before and I'm curious if you can point me to any benchmarks or other sources that demonstrate this. If it's true it's surprising to me!

reply
jitl
27 days ago
[-]
In my line of work (writing JavaScript that will run on Android devices) there is no such thing as fast enough. If you’re not going as fast as possible the experience is bad. After all this is a platform where getting data from the local disk is frequently slower than the network because there’s so little CPU & disk bandwidth…
reply
kragen
27 days ago
[-]
Also, if you're computing for 3 milliseconds for every touchmove event instead of 0.3 milliseconds, you'll run the battery down a lot faster. Though not actually ten times faster, because the screen keeps sucking power even when the CPU is asleep.
reply
kragen
27 days ago
[-]
I can't! But if you have a conventional lex/yacc parser handy, it should be easy to test by measuring the time for just lexing a large input file and discarding the tokens. If you do the experiment, I'm interested to hear your results!

I don't believe that software is ever fast enough, but trading off speed is often worthwhile, for example to get composability or readability.

reply
pdubroy
27 days ago
[-]
I found some benchmarks in the ANTLR project: https://github.com/antlr/grammars-v4/blob/master/java/java/B...

  Project          Parsing/Lexing Ratio
  ----------------------------------------
  jdk8               5.34x
  Spring Framework   3.81x
  Elasticsearch      5.76x
  RxJava             5.69x
  JUnit4             4.51x
  Guava              5.37x
  Log4j              2.92x
Obviously this is just one toolkit, one language, and one set of examples. But in these benchmarks, the tokenization (lexing) time is a small fraction of the total parse time.
reply
vanderZwan
27 days ago
[-]
Well, "small" is relative. It's certainly not a bottleneck, but it's still between 17.4% to 34.2% of the total time. That's definitely still in range where optimizing could have a measurable impact on performance difference (depending on how much room there's left for optimization).

100/2.92 = 34.2%

100/5.76 = 17.4%

reply
kragen
27 days ago
[-]
Right, but it's clearly not the situation I was describing where cutting the parse time to zero for the stages after the tokenization stage would only slightly decrease total time.
reply
kragen
27 days ago
[-]
Thank you!
reply
skybrian
27 days ago
[-]
This looks very interesting, particularly the visualization tool.

One thing I think it sheds light on is the difference between a simple library and a framework. Many frameworks can be thought of as providing an interpreter that you configure using callbacks. Debugging a high-level program by stepping through the interpreter that's executing it isn't easy. Custom debugging tools make it easier.

reply
jbreckmckye
27 days ago
[-]
For an example usecase, the Bruno API client uses Ohm to parse its .bru API spec files

https://github.com/usebruno/bruno

reply
patrickthebold
28 days ago
[-]
In the past, I had always had a distaste for 'stringy' data, always preferring more structure (e.g objects). Once I learned about parsers, I realized you can still have structure in a string.
reply
fjfaase
27 days ago
[-]
Looks quite nice. The online editor with color coding looks sleek. I suppose that they are using a PEG-parser to parse the grammar in the online editor.

Can you also specify color coding for your grammar?

This is a parser generator. In the online editor, even with the example grammar, you can see that there is some delay when you change the grammar in what is being parsed. I wrote an interactive back-tracking parser with caching that does not show less delay with a bit larger grammar. Search for 'IParse Studio'.

reply
deostroll
27 days ago
[-]
Anybody out there who has tried to convert js to ast and back and got OOM?! (I tried @babel/parser). I pray this one doesn't disappoint.

Oh, and comments!!! My general observation is that one cannot capture the comments when code is generated from ast. Although esprima has something in the standard to this effect, implementations generally stuck with weird commenting styles and how to generate code from them...for e.g

var /* hello */ world, foo, /* bar */ s;

reply
incrudible
27 days ago
[-]
I have a rule that says: You are not allowed to use these tools until you have written a recursive descent parser yourself. Parsing manually is not that hard, but giving good errors while parsing automatically is not that easy.
reply
lukasb
28 days ago
[-]
Can anyone compare to Chevrotain?
reply
Cadwhisker
28 days ago
[-]
There's a benchmark tool here: https://chevrotain.io/performance/

Ohm appears to be much slower.

reply
derhuerst
27 days ago
[-]
remotely related: When I used nearley.js [0], it had a very good learning curve. Tooling also looks quite nice.

[0] https://nearley.js.org/

reply
RickS
28 days ago
[-]
Oh hell yeah. Stoked to see something like this in TS. I enjoy making a toy DSL here and there, but it's a lot of friction to relearn Lex and Yacc every time. The editor looks great too, really like the visualizations.

For anybody already using this, what's your workflow like for getting from this to IDE syntax highlighting?

reply
adamb
28 days ago
[-]
I experimented with an Ohm/CodeMirror bridge that would map an Ohm grammar to CodeMirror classes for marks and syntax highlighting.

It might be an interesting starting point for you: https://observablehq.com/@ajbouh/editor

reply
wruza
27 days ago
[-]
The last time I needed a parser that supports lex/yacc experience, jison https://gerhobbelt.github.io/jison/docs/ worked fine for me.

I don’t find PEGs more useful or clear, tbh. Mostly because how precedence works.

reply
e12e
28 days ago
[-]
I could swear I recently saw an article about a peg-tool that came with lsp support - sadly I can't find it. The closest I come is pest which has an ide-tools crate and some plugins for lsp/ide/editor support:

https://github.com/pest-parser

reply
mdaniel
28 days ago
[-]
I have been on the lookout for something that code help me with Jinja files and it seems like a plausible candidate. Extra bonus points if I could actually embed the Python in the same grammar but at this point even having the code blocks segregated from the literal parts is a good start

https://pest.rs/?g=N4Ig5gTghgtjURALhAMwJYBsCmACAvLsLgMoDyAkr...

reply
gr4vityWall
27 days ago
[-]
My first impression is that the tutorial is exceptionally well-written. This project looks very interesting!
reply
phfrohring
28 days ago
[-]
Great! Worth trying! I've built a parser for the robotic arm language (LCD 350) with it in record time.
reply
pdubroy
27 days ago
[-]
Oh, that sounds cool! Are you on the Ohm Discord? I'd love to hear a bit more about this project :)
reply
doug_life
28 days ago
[-]
Side note, I really wish software would stop using EE terminology to name things. Ohm, Capacitor, Electron, etc. It muddies up search results for no good reason. Is there a reason this has become a trend lately?
reply
kragen
28 days ago
[-]
Ohm is presumably called that because it's based on OMeta, which was called that because it was a successor to META-II that had inheritance, like OO languages.

More generally, I suspect software programmers have EE envy because electrical engineers build things that actually do work, and their designs get better over time.

reply