FilterHN

Intuiting Pratt Parsing

111 points

by signa11

2 days ago

| past

| 9 comments

| louis.co.nz

| HN

▲

logdahl

6 hours ago

[-]

Love Pratt parsing! Not a compiler guy, but I've spent way too many hours reflecting on parsing. I remember trying to get though the dragon book so many times and reading all about formal grammar etc. Until I landed on; recursive descent parsing + Pratt for expressions. Super simple technique, and for me is sufficient. I'm sure it doesn't cover all cases, but just for toy languages it feels like we can usually do everything with 2-token lookahead.

Not to step on anyone's toes, I just don't feel that formal grammar theory is that important in practice. :^)

▲

eru

5 hours ago

[-]

The Dragon book is not very good, to be honest.

It was probably decent when all you had was something like Pascal and you wanted to write a C compiler.

Parsing and compiling and interpreting etc are all much more at home in functional languages. Much easier to understand there. And once you do, then you can translate back into imperative.

For parsing: by default you should be using parser combinators.

▲

pklausler

3 hours ago

[-]

Is there a production compiler out there that doesn't use recursive descent, preferably constructed from combinators? Table-driven parsers seem now to be a "tell" of an old compiler or a hobby project.

▲

eru

2 hours ago

[-]

Oh, I was talking much more about how you can first learn how to write a compiler. I wasn't talking about how you write a production industry-strength compiler.

Btw, I mentioned parser combinators: those are basically just a front-end. Similar to regular expressions. The implementation can be all kinds of things, eg could be recursive descent or a table or backtracking or whatever. (Even finite automata, if your combinators are suitably restricted.)

▲

pklausler

2 hours ago

[-]

I used a small custom parser combinator library to parse Fortran from raw characters (since tokenization is so context-dependent), and it's worked well.

▲

dbcurtis

2 hours ago

[-]

The thing about LR parsers is that since it is parsing bottom-up, you have no idea what larger syntactic structure is being built, so error recovery is ugly, and giving the user a sensible error message is a fool’s errand.

In the end, all the hard work in a compiler is in the back-end optimization phases. Put your mental energy there.

▲

ogogmad

2 hours ago

[-]

Some people appreciate that an LR/LALR parser generator can prove non-ambiguity and linear time parse-ability of a grammar. A couple of examples are the creator of the Oil shell, and one of the guys responsible for Rust.

It does make me wonder though about why grammars have to be so complicated that such high-powered tools are needed. Isn't the gist of LR/LALR that the states of an automaton that can parse CFGs can be serialised to strings, and the set of those strings forms a regular language? Once you have that, many desirable "infinitary" properties of a parsing automaton can be automatically checked in finite time. LR and LALR fall out of this, in some way.

▲

pklausler

2 hours ago

[-]

Production compilers must have robust error recovery and great error messages, and those are pretty straightforward in recursive descent, even if ad hoc.

▲

dbcurtis

2 hours ago

[-]

I was just going into the second quarter of compiler design when the dragon book came out. My copy was still literally “hot of the press” — still warm from the ink baking ovens. It was worlds better that anything else available at the time.

▲

marssaxman

36 minutes ago

[-]

I am a compiler guy, and I completely agree. Parsing is not that hard and not that important. Recursive descent + pratt expressions is almost always the practical choice.

▲

gignico

6 hours ago

[-]

Until you need to do more than all-or-nothing parsing :) see tree-sitter for example, or any other efficient LSP implementation of incremental parsing.

▲

norir

2 hours ago

[-]

It is easily possible to parse at > 1MM lines per second with a well designed grammar and handwritten parser. If I'm editing a file with 100k+ lines, I likely have much bigger problems than the need for incremental parsing.

▲

fwip

1 hour ago

[-]

It's not just speed - incremental parsing allows for better error recovery. In practice, this means that your editor can highlight the code as-you-type, even though what you're typing has broken the parse tree (especially the code after your edit point).

▲

randomNumber7

5 hours ago

[-]

It's not for toy languages. Most big compilers use recursive descent parsing.

▲

ebiederm

35 minutes ago

[-]

Language design benefits from parser generators that can point out ambiguities and verify a language is easy to parse.

▲

signa11

5 hours ago

[-]

> Not to step on anyone's toes, I just don't feel that formal grammar theory is that important in practice. :^)

exactly this ! a thousand times this !

▲

ogogmad

5 hours ago

[-]

I think even the theory of Regular Languages is somewhat overdone: You can get the essence of what NFAs are without really needing NFAs. You can get O(n) string matching without formally implementing NFAs, or using any other formal model like regex-derivatives. In fact, thinking in terms of NFAs makes it harder to see how to implement negation (or "complement" if you prefer to call it that) efficiently. It's still only linear time!

The need for NFA/DFA/derivative models is mostly unnecessary because ultimately, REG is just DSPACE(O(1)). That's it. Thinking in any other way is confusing the map with the territory. Furthermore, REG is extremely robust, because we also have REG = DSPACE(o(log log n)) = NSPACE(o(log log n)) = 1-DSPACE(o(log n)). For help with the notation, see here: https://en.wikipedia.org/wiki/DSPACE

▲

ogogmad

5 hours ago

[-]

Quick other one: To parse infix expressions, every time you see "x·y | (z | w)", find the operator of least binding power: In my example, I've given "|" less binding power than "·". Anyway, this visually breaks the expression into two halves: "x·y" and "(z | w)". Recursively parse those two subexpressions. Essentially, that's it.

The symbols "·" and "|" don't mean anything - I've chosen them to be visually intuitive: The "|" is supposed to look like a physical divider. Also, bracketed expressions "(...)" or "{...}" should be parsed first.

Wikipedia mentions that a variant of this got used in FORTRAN I. You could also speed up my naive O(n^2) approach by using Cartesian trees, which you can build using something suspiciously resembling precedence climbing.

▲

duped

11 minutes ago

[-]

An even easier approach is to give all infix operators the same precedence and force the programmer to group subexpressions.

▲

randomNumber7

5 hours ago

[-]

I can recommend anyone reading pratts original paper. Its written in a very cool and badass style.

https://dl.acm.org/doi/epdf/10.1145/512927.512931

▲

DonaldPShimoda

3 hours ago

[-]

> Its written in a very cool and badass style.

Out of curiosity, what do you mean by this? Do you mean you like the prose, or the typesetting, or...?

▲

fmbb

1 hour ago

[-]

I cannot say what this person means, and I have never read this paper before, but just the fourth paragraph of the paper has piqued my interest and I will read it all.

▲

tonyedgecombe

3 hours ago

[-]

For some reason I struggled to get my head around Pratt parsing. Then I read an offhand comment on Reddit that said to start with a recursive descent parser and add table parsing to that. Once I did that it all clicked.

▲

caspianm

41 minutes ago

[-]

"I’ve read many articles on the same topic but never found it presented this way" it reminds me a lot of a description I saw in a video with Jonathan Blow talking about precedence and parsing with Casey Muratori.

The video is 3 hours long though, and I'm not sure the text he shows is available.

At this point he's talking about left leaning vs right leaning trees, after having already talked about one of them: https://youtu.be/fIPO4G42wYE?t=2256&si=aanthLGe-q8ntZez

▲

svat

5 hours ago

[-]

> I’ve read many articles on the same topic but never found it presented this way - hopefully N + 1 is of help to someone.

Can confirm; yes it was helpful! I've never thought seriously about parsing and I've read occasionally (casually) about Pratt parsing, but this is the first time it seemed like an intuitive idea I'll remember.

(Then I confused myself by following some references and remembering the term "precedence climbing" and reading e.g. https://www.engr.mun.ca/~theo/Misc/pratt_parsing.htm by the person who coined that term, but nevermind — the original post here has still given me an idea I think I'll remember.)

▲

hyperhello

5 hours ago

[-]

You can either use the stack in an intuitive way, or you can change the tree directly in a somewhat less intuitive way without recursion. Essentially either DF or BF. I don’t see how it matters much anymore with stacks that grow automatically, but it’s good to understand.

▲

antirez

3 hours ago

[-]

The latest implementation of Picol has a Tcl-alike [expr] implemented in 40 lines of code that uses Pratt-style parsing: https://github.com/antirez/picol/blob/main/picol.c#L490

▲

priceishere

6 hours ago

[-]

An even simpler way imo, is explicit functions instead of a precedence table, then the code pretty much has the same structure as EBNF.

Need to parse * before +? Begin at add, have it call parse_mul for its left and right sides, and so on.

  parse_mul() {
    left = parse_literal()
    while(is_mul_token()) { // left associative
      right = parse_literal()
      make_mul_node(left, right)
    }
  }

  parse_add() {
    left = parse_mul()
    while(is_add_token()) { // left associative
      right = parse_mul()
      make_add_node(left, right)
    }
  }

Then just add more functions as you climb up the precedence levels.

▲

kryptiskt

5 hours ago

[-]

You lose in versatility, then you can't add user-defined operators, which is pretty easy with a Pratt parser.

▲

wavemode

1 hour ago

[-]

You can have user-defined operators with plain old recursive descent.

Consider if you had functions called parse_user_ops_precedence_1, parse_user_ops_precedence_2, etc. These would simply take a table of user-defined operators as an argument (or reference some shared/global state), and participate in the same recursive callstack as all your other parsing functions.

▲

glouwbug

1 hour ago

[-]

With a couple of function pointers you can climb precedence with just functions:

  parse_left_to_right(with(), is_token()) {
    left = with()
    while(is_token()) {
      right = with()
      left = operate(left, right, operator)
    }
    ret left;
  }

  p0() { ret lex digit or ident; };
  p1() { ret parse_left_right(p0, is_mul); };
  p2() { ret parse_left_right(p1, is_add); };

... and so on for all operators

▲

IshKebab

4 hours ago

[-]

Also if you're looking into this area you'll find there is another algorithm called "Precedence climbing", which is really the same thing with some insignificant differences in how precedence is encoded.

There's also the "shunting yard" algorithm, which is basically the iterative version of these algorithms (instead of recursive). It is usually presented with insufficient error checking, so it allows invalid input, but there's actually no reason you have to do it like that.

▲

dpratt

1 hour ago

[-]

I will never forget the amusing attention I got from the professor when this topic was covered during my undergrad. It's only happened once, sadly, but this is only seconded by the time I was assisting a junior engineer with a related problem and was able to say "Oh, that's just a Pratt Parser. Let me show you."