FilterHN

Cadwhisker

4 months ago

[-]

I'm particularly impressed with the visual PEG debugger tool, here: https://ohmjs.org/editor

PEGs are extremely difficult to debug when things go wrong because you get a MASSIVE trace (if you're lucky) to sift through. This tool has rewind/playback controls to dig into what the PEG rules are doing.

Previously, the best I'd seen was cpp-peglib's online parser, here: https://yhirose.github.io/cpp-peglib/

Edit: and now having read through comments here, Chevrotain's PEG tool looks really impressive, although it doesn't use a simple PEG syntax for the input: https://chevrotain.io/playground/

4 months ago

[-]

Thanks for the kind words :)

I also wrote a blog post about some of the decisions that went into the visualizer: https://dubroy.com/blog/visualizing-packrat-parsing/

Cadwhisker

4 months ago

[-]

I spent some time looking at debugging with Chevrotain and your solution is so much easier to use. That's a great blog post; thanks again.

vanderZwan

4 months ago

[-]

BTW, if you're like me and have been wanting to target WASM bytecode directly for ages without any of the heavy existing toolchains, but unable to put the separate pieces of the spec together yourself: the core maintainer of this library is co-authoring a book for that[0]. I decided to take the risk of giving the early access version a try and it's really nice and accessible.

I'm mentioning it because it implements a toy language using Ohm to explain how WebAssembly works (gee wonder why). So it actually includes a mini-Ohm tutorial.

Funny enough, since WASM bytecode is designed to be compact, and speccing out parsers tends to be a more verbose business, the book technically ends up spending more words on implementing the parser than on WASM itself (all with the goal of developing a mental model for WASM though, which it succeeded at for me so this is not a critique).

[0] https://wasmgroundup.com/

nhatcher

4 months ago

[-]

I am jealous of kids these days learning the theory of parsing. There are so many great resources out there! Ohm in particular looks great, attention to detail, care for the theory. Makes me wish I had a project to try it out.

I am a big fan of PEG parsers. They do come with their set issues and difficulties but I always found them great to work with. My to go tool (also a PEG parser similar to Ohm) used to be pegjs now https://peggyjs.org/

When I needed speed, or a more classical take, I would use jison. But I think today I would have to find a good reason not to build a hand made parser.

[1] https://jeffreykegler.github.io/personal/timeline_v3

vanderZwan

4 months ago

[-]

Yeah, agreed.

I just wish someone would write a "how Marpa parsing works for dummies" article. I've been curious about how it actually works ever since I was colleagues with Aria Stewart, who ported it to JS once upon a time and mentioned it's existence to me (we worked remotely and in opposite time-zones, so I never had a chance to sit down with her for a good old in-person explanation).

[0] https://savage.net.au/Marpa.html

[2] https://github.com/aredridel/lotsawa

fovc

4 months ago

[-]

Great to see this is alive and progressing! I believe Ohm started life in Alan Kay’s research group, to build a graphical OS and office suite in 10k lines of code. I found this talk immensely inspiring https://m.youtube.com/watch?v=ubaX1Smg6pY

4 months ago

[-]

Very close! Alex Warth created OMeta (https://en.wikipedia.org/wiki/OMeta) as part of the STEPS project. Ohm was designed as a kind of successor to OMeta, but was created after STEPS.

agumonkey

4 months ago

[-]

seems like it's his github webpage https://alexwarth.github.io/

and it's full of great stuff

fovc

4 months ago

[-]

Ah thanks for the clarification! Do you happen to know if Nile/Gezira went anywhere?

mhitza

4 months ago

[-]

Can anyone familiar with PEG explain how much it deviates ("inspired by") from "standard" PEG?

I've only used PEG once because of builtin support in Guile scheme, but this library might be a bit more useful, as it's in JS.

4 months ago

[-]

Maintainer of Ohm here :)

The main difference in Ohm grammars vs "standard" PEGs is support for left recursion. It's the same approach that was used in OMeta, which is described in the paper "Packrat parsers can support left recursion": https://web.cs.ucla.edu/~todd/research/pepm08.pdf

The other thing Ohm does that's different from many PEG parsing libraries is a strict separation of syntax and semantics. Many PEG tools mix these two concepts, so you write little snippets of code (semantic actions) alongside the rules.

jitl

4 months ago

[-]

Top thing I want to know about a parser/parser generator library, is will it go fast? After a poor experience with Chevrotain startup performance I’m more open to parser-generators than parsers, I’m very happy to pay for a build step if my compiler will boot up faster (important in the browser) and have better throughout.

4 months ago

[-]

Ohm is a PEG parser generator. There are theoretical algorithms for making PEG parsers run fast, and people have published some excellent papers on them, but I've never seen them deployed in real software. (Search for PEG automatic cut insertion.) I haven't tried Ohm, but nothing on its home page suggests that it's an exception to the general rule that PEG parsers are very slow.

Probably if what you most want is for your parser to go fast, you should write a predictive recursive descent parser by hand.

4 months ago

[-]

It's worth noting that Python 3 uses a PEG-based parser and its performance (when it was introduced at least) was within 10% of the old parser: https://peps.python.org/pep-0617/

4 months ago

[-]

In absolute terms, the numbers given in the PEP work out to about a thousand machine instructions per byte of input, for both the old and new parsers. From my point of view, that's pretty slow.

Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing, and which I believe did not change between the old recursive-descent parser and the new PEG parser. Typically the tokenizer must iterate over some 32 characters per line of input code, which it reduces to some 8 tokens for the benefit of the parser proper, so usually the tokenizer is the bottleneck. However, the performance in this case seems too poor to attribute to the tokenizer.

That PEP also discusses some avoidable inefficiencies in the old parser that accounted for a substantial fraction of its runtime.

One of the biggest advantages of PEGs is that you don't need a separate tokenization pass. But that means you are incurring Packrat's inefficiency per character instead of per token. As I said above, I believe the CPython implementation does not do this.

I've written and worked on PEG parser generators, and I think PEG parsing is the best option in a wide variety of circumstances. But it is rarely the fastest option and usually far from it.

4 months ago

[-]

Yes, you're right that PEG parsing is usually not the fastest option. As with many things, the operative question is: is it fast _enough_?

> Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing

Hmmm, I've never heard this before and I'm curious if you can point me to any benchmarks or other sources that demonstrate this. If it's true it's surprising to me!

jitl

4 months ago

[-]

In my line of work (writing JavaScript that will run on Android devices) there is no such thing as fast enough. If you’re not going as fast as possible the experience is bad. After all this is a platform where getting data from the local disk is frequently slower than the network because there’s so little CPU & disk bandwidth…

4 months ago

[-]

Also, if you're computing for 3 milliseconds for every touchmove event instead of 0.3 milliseconds, you'll run the battery down a lot faster. Though not actually ten times faster, because the screen keeps sucking power even when the CPU is asleep.

4 months ago

[-]

I can't! But if you have a conventional lex/yacc parser handy, it should be easy to test by measuring the time for just lexing a large input file and discarding the tokens. If you do the experiment, I'm interested to hear your results!

I don't believe that software is ever fast enough, but trading off speed is often worthwhile, for example to get composability or readability.

4 months ago

[-]

I found some benchmarks in the ANTLR project: https://github.com/antlr/grammars-v4/blob/master/java/java/B...

  Project          Parsing/Lexing Ratio
  ----------------------------------------
  jdk8               5.34x
  Spring Framework   3.81x
  Elasticsearch      5.76x
  RxJava             5.69x
  JUnit4             4.51x
  Guava              5.37x
  Log4j              2.92x

Obviously this is just one toolkit, one language, and one set of examples. But in these benchmarks, the tokenization (lexing) time is a small fraction of the total parse time.

vanderZwan

4 months ago

[-]

Well, "small" is relative. It's certainly not a bottleneck, but it's still between 17.4% to 34.2% of the total time. That's definitely still in range where optimizing could have a measurable impact on performance difference (depending on how much room there's left for optimization).

100/2.92 = 34.2%

100/5.76 = 17.4%

4 months ago

[-]

Right, but it's clearly not the situation I was describing where cutting the parse time to zero for the stages after the tokenization stage would only slightly decrease total time.

4 months ago

[-]

Thank you!

skybrian

4 months ago

[-]

This looks very interesting, particularly the visualization tool.

One thing I think it sheds light on is the difference between a simple library and a framework. Many frameworks can be thought of as providing an interpreter that you configure using callbacks. Debugging a high-level program by stepping through the interpreter that's executing it isn't easy. Custom debugging tools make it easier.

https://github.com/usebruno/bruno

jbreckmckye

4 months ago

[-]

For an example usecase, the Bruno API client uses Ohm to parse its .bru API spec files

patrickthebold

4 months ago

[-]

In the past, I had always had a distaste for 'stringy' data, always preferring more structure (e.g objects). Once I learned about parsers, I realized you can still have structure in a string.

fjfaase

4 months ago

[-]

Looks quite nice. The online editor with color coding looks sleek. I suppose that they are using a PEG-parser to parse the grammar in the online editor.

Can you also specify color coding for your grammar?

This is a parser generator. In the online editor, even with the example grammar, you can see that there is some delay when you change the grammar in what is being parsed. I wrote an interactive back-tracking parser with caching that does not show less delay with a bit larger grammar. Search for 'IParse Studio'.

deostroll

4 months ago

[-]

Anybody out there who has tried to convert js to ast and back and got OOM?! (I tried @babel/parser). I pray this one doesn't disappoint.

Oh, and comments!!! My general observation is that one cannot capture the comments when code is generated from ast. Although esprima has something in the standard to this effect, implementations generally stuck with weird commenting styles and how to generate code from them...for e.g

var /* hello */ world, foo, /* bar */ s;

incrudible

4 months ago

[-]

I have a rule that says: You are not allowed to use these tools until you have written a recursive descent parser yourself. Parsing manually is not that hard, but giving good errors while parsing automatically is not that easy.

lukasb

4 months ago

[-]

Can anyone compare to Chevrotain?

Cadwhisker

4 months ago

[-]

There's a benchmark tool here: https://chevrotain.io/performance/

Ohm appears to be much slower.

derhuerst

4 months ago

[-]

remotely related: When I used nearley.js [0], it had a very good learning curve. Tooling also looks quite nice.

[0] https://nearley.js.org/

RickS

4 months ago

[-]

Oh hell yeah. Stoked to see something like this in TS. I enjoy making a toy DSL here and there, but it's a lot of friction to relearn Lex and Yacc every time. The editor looks great too, really like the visualizations.

For anybody already using this, what's your workflow like for getting from this to IDE syntax highlighting?

adamb

4 months ago

[-]

I experimented with an Ohm/CodeMirror bridge that would map an Ohm grammar to CodeMirror classes for marks and syntax highlighting.

It might be an interesting starting point for you: https://observablehq.com/@ajbouh/editor

wruza

4 months ago

[-]

The last time I needed a parser that supports lex/yacc experience, jison https://gerhobbelt.github.io/jison/docs/ worked fine for me.

I don’t find PEGs more useful or clear, tbh. Mostly because how precedence works.

https://github.com/pest-parser

e12e

4 months ago

[-]

I could swear I recently saw an article about a peg-tool that came with lsp support - sadly I can't find it. The closest I come is pest which has an ide-tools crate and some plugins for lsp/ide/editor support:

https://pest.rs/?g=N4Ig5gTghgtjURALhAMwJYBsCmACAvLsLgMoDyAkr...

mdaniel

4 months ago

[-]

I have been on the lookout for something that code help me with Jinja files and it seems like a plausible candidate. Extra bonus points if I could actually embed the Python in the same grammar but at this point even having the code blocks segregated from the literal parts is a good start

gr4vityWall

4 months ago

[-]

My first impression is that the tutorial is exceptionally well-written. This project looks very interesting!

phfrohring

4 months ago

[-]

Great! Worth trying! I've built a parser for the robotic arm language (LCD 350) with it in record time.

4 months ago

[-]

Oh, that sounds cool! Are you on the Ohm Discord? I'd love to hear a bit more about this project :)

doug_life

4 months ago

[-]

Side note, I really wish software would stop using EE terminology to name things. Ohm, Capacitor, Electron, etc. It muddies up search results for no good reason. Is there a reason this has become a trend lately?