Not to step on anyone's toes, I just don't feel that formal grammar theory is that important in practice. :^)
It was probably decent when all you had was something like Pascal and you wanted to write a C compiler.
Parsing and compiling and interpreting etc are all much more at home in functional languages. Much easier to understand there. And once you do, then you can translate back into imperative.
For parsing: by default you should be using parser combinators.
Btw, I mentioned parser combinators: those are basically just a front-end. Similar to regular expressions. The implementation can be all kinds of things, eg could be recursive descent or a table or backtracking or whatever. (Even finite automata, if your combinators are suitably restricted.)
In the end, all the hard work in a compiler is in the back-end optimization phases. Put your mental energy there.
It does make me wonder though about why grammars have to be so complicated that such high-powered tools are needed. Isn't the gist of LR/LALR that the states of an automaton that can parse CFGs can be serialised to strings, and the set of those strings forms a regular language? Once you have that, many desirable "infinitary" properties of a parsing automaton can be automatically checked in finite time. LR and LALR fall out of this, in some way.
exactly this ! a thousand times this !
The need for NFA/DFA/derivative models is mostly unnecessary because ultimately, REG is just DSPACE(O(1)). That's it. Thinking in any other way is confusing the map with the territory. Furthermore, REG is extremely robust, because we also have REG = DSPACE(o(log log n)) = NSPACE(o(log log n)) = 1-DSPACE(o(log n)). For help with the notation, see here: https://en.wikipedia.org/wiki/DSPACE
The symbols "·" and "|" don't mean anything - I've chosen them to be visually intuitive: The "|" is supposed to look like a physical divider. Also, bracketed expressions "(...)" or "{...}" should be parsed first.
Wikipedia mentions that a variant of this got used in FORTRAN I. You could also speed up my naive O(n^2) approach by using Cartesian trees, which you can build using something suspiciously resembling precedence climbing.
Out of curiosity, what do you mean by this? Do you mean you like the prose, or the typesetting, or...?
The video is 3 hours long though, and I'm not sure the text he shows is available.
At this point he's talking about left leaning vs right leaning trees, after having already talked about one of them: https://youtu.be/fIPO4G42wYE?t=2256&si=aanthLGe-q8ntZez
Can confirm; yes it was helpful! I've never thought seriously about parsing and I've read occasionally (casually) about Pratt parsing, but this is the first time it seemed like an intuitive idea I'll remember.
(Then I confused myself by following some references and remembering the term "precedence climbing" and reading e.g. https://www.engr.mun.ca/~theo/Misc/pratt_parsing.htm by the person who coined that term, but nevermind — the original post here has still given me an idea I think I'll remember.)
Need to parse * before +? Begin at add, have it call parse_mul for its left and right sides, and so on.
parse_mul() {
left = parse_literal()
while(is_mul_token()) { // left associative
right = parse_literal()
make_mul_node(left, right)
}
}
parse_add() {
left = parse_mul()
while(is_add_token()) { // left associative
right = parse_mul()
make_add_node(left, right)
}
}
Then just add more functions as you climb up the precedence levels.Consider if you had functions called parse_user_ops_precedence_1, parse_user_ops_precedence_2, etc. These would simply take a table of user-defined operators as an argument (or reference some shared/global state), and participate in the same recursive callstack as all your other parsing functions.
parse_left_to_right(with(), is_token()) {
left = with()
while(is_token()) {
right = with()
left = operate(left, right, operator)
}
ret left;
}
p0() { ret lex digit or ident; };
p1() { ret parse_left_right(p0, is_mul); };
p2() { ret parse_left_right(p1, is_add); };
... and so on for all operatorsThere's also the "shunting yard" algorithm, which is basically the iterative version of these algorithms (instead of recursive). It is usually presented with insufficient error checking, so it allows invalid input, but there's actually no reason you have to do it like that.