PEGs are extremely difficult to debug when things go wrong because you get a MASSIVE trace (if you're lucky) to sift through. This tool has rewind/playback controls to dig into what the PEG rules are doing.
Previously, the best I'd seen was cpp-peglib's online parser, here: https://yhirose.github.io/cpp-peglib/
Edit: and now having read through comments here, Chevrotain's PEG tool looks really impressive, although it doesn't use a simple PEG syntax for the input: https://chevrotain.io/playground/
I also wrote a blog post about some of the decisions that went into the visualizer: https://dubroy.com/blog/visualizing-packrat-parsing/
I'm mentioning it because it implements a toy language using Ohm to explain how WebAssembly works (gee wonder why). So it actually includes a mini-Ohm tutorial.
Funny enough, since WASM bytecode is designed to be compact, and speccing out parsers tends to be a more verbose business, the book technically ends up spending more words on implementing the parser than on WASM itself (all with the goal of developing a mental model for WASM though, which it succeeded at for me so this is not a critique).
I am a big fan of PEG parsers. They do come with their set issues and difficulties but I always found them great to work with. My to go tool (also a PEG parser similar to Ohm) used to be pegjs now https://peggyjs.org/
When I needed speed, or a more classical take, I would use jison. But I think today I would have to find a good reason not to build a hand made parser.
I just wish someone would write a "how Marpa parsing works for dummies" article. I've been curious about how it actually works ever since I was colleagues with Aria Stewart, who ported it to JS once upon a time and mentioned it's existence to me (we worked remotely and in opposite time-zones, so I never had a chance to sit down with her for a good old in-person explanation).
[0] https://savage.net.au/Marpa.html
and it's full of great stuff
I've only used PEG once because of builtin support in Guile scheme, but this library might be a bit more useful, as it's in JS.
The main difference in Ohm grammars vs "standard" PEGs is support for left recursion. It's the same approach that was used in OMeta, which is described in the paper "Packrat parsers can support left recursion": https://web.cs.ucla.edu/~todd/research/pepm08.pdf
The other thing Ohm does that's different from many PEG parsing libraries is a strict separation of syntax and semantics. Many PEG tools mix these two concepts, so you write little snippets of code (semantic actions) alongside the rules.
Probably if what you most want is for your parser to go fast, you should write a predictive recursive descent parser by hand.
Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing, and which I believe did not change between the old recursive-descent parser and the new PEG parser. Typically the tokenizer must iterate over some 32 characters per line of input code, which it reduces to some 8 tokens for the benefit of the parser proper, so usually the tokenizer is the bottleneck. However, the performance in this case seems too poor to attribute to the tokenizer.
That PEP also discusses some avoidable inefficiencies in the old parser that accounted for a substantial fraction of its runtime.
One of the biggest advantages of PEGs is that you don't need a separate tokenization pass. But that means you are incurring Packrat's inefficiency per character instead of per token. As I said above, I believe the CPython implementation does not do this.
I've written and worked on PEG parser generators, and I think PEG parsing is the best option in a wide variety of circumstances. But it is rarely the fastest option and usually far from it.
> Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing
Hmmm, I've never heard this before and I'm curious if you can point me to any benchmarks or other sources that demonstrate this. If it's true it's surprising to me!
I don't believe that software is ever fast enough, but trading off speed is often worthwhile, for example to get composability or readability.
Project Parsing/Lexing Ratio
----------------------------------------
jdk8 5.34x
Spring Framework 3.81x
Elasticsearch 5.76x
RxJava 5.69x
JUnit4 4.51x
Guava 5.37x
Log4j 2.92x
Obviously this is just one toolkit, one language, and one set of examples. But in these benchmarks, the tokenization (lexing) time is a small fraction of the total parse time.100/2.92 = 34.2%
100/5.76 = 17.4%
One thing I think it sheds light on is the difference between a simple library and a framework. Many frameworks can be thought of as providing an interpreter that you configure using callbacks. Debugging a high-level program by stepping through the interpreter that's executing it isn't easy. Custom debugging tools make it easier.
Can you also specify color coding for your grammar?
This is a parser generator. In the online editor, even with the example grammar, you can see that there is some delay when you change the grammar in what is being parsed. I wrote an interactive back-tracking parser with caching that does not show less delay with a bit larger grammar. Search for 'IParse Studio'.
Oh, and comments!!! My general observation is that one cannot capture the comments when code is generated from ast. Although esprima has something in the standard to this effect, implementations generally stuck with weird commenting styles and how to generate code from them...for e.g
var /* hello */ world, foo, /* bar */ s;
Ohm appears to be much slower.
For anybody already using this, what's your workflow like for getting from this to IDE syntax highlighting?
It might be an interesting starting point for you: https://observablehq.com/@ajbouh/editor
I don’t find PEGs more useful or clear, tbh. Mostly because how precedence works.
https://pest.rs/?g=N4Ig5gTghgtjURALhAMwJYBsCmACAvLsLgMoDyAkr...
More generally, I suspect software programmers have EE envy because electrical engineers build things that actually do work, and their designs get better over time.