Parsing Advances
93 points
15 hours ago
| 4 comments
| matklad.github.io
| HN
tgv
6 hours ago
[-]
So ... someone calls their parsing strategy "resilient LL parsing" without actually implementing LL parsing, a technique known since the 1970s, and then has an infinite recursion bug? Probably skipped Parsing 101.
reply
smj-edison
13 hours ago
[-]
Huh, that's a really interesting approach. I just wrote my first Pratt parser a month ago, and one of the most annoying things was debugging infinite loops in various places (I had both tokenizer bugs where no characters were consumed and parser bugs where a token was emitted but not advanced). It's doubly annoying in Zig, because the default test runner won't print out stdout at all, and won't print stderr unless the program terminates by itself (Ctrl + C doesn't print). I resorted to building the test and running it manually, or jumping into a debugger to figure out recursion issues. It's working now, but if (really when) I run into issues in the future I'll definitely add some helper functions to check emitting invariants.
reply
someone_jain_
9 hours ago
[-]
its also very annoying that one can't have two test names where one is substring of other
reply
eru
7 hours ago
[-]
Writing parsers by hand this way can be fun (and might be required for the highest performance ones, maybe?), but for robustness and ease of development you are generally better off using a parser combinator library.
reply
tubs
3 hours ago
[-]
Are you?

The majority of production compilers use hand rolled parsers, ostensibly for better error reporting and panic synch.

reply
cipherself
2 hours ago
[-]
One anecdote in the same vein, a couple of months ago, I wanted to parse systemd-networkd INI files in Python and the python built-in ConfigParser [0] and pytest's iniconfig parser [1] couldn't handle multiple sections with the same name so I ended up writing 2 parsers, one using a ParserCombinator library and one by hand and ended up using the latter given it was much simpler to understand and I didn't have to introduce an extra dependency.

Admittedly, INI is quite a simple format, hence I mention this as an anecdote.

[0] https://docs.python.org/3/library/configparser.html

[1] https://github.com/pytest-dev/iniconfig

reply
thechao
1 hour ago
[-]
As a project gets larger the cost of owning a dependency directly begins to outweigh the impedance mismatch between 3rd party software & software customized to your project.

I've got 10 full time senior engineers on a project heading in to its 15th year. We rewrite even extremely low level code like std::vector or malloc to make sure it matches our requirements.

UNIX was written by a couple of dudes.

reply
kccqzy
13 hours ago
[-]
How about another way, which is memoization: at each position in the source code we never attempt to parse the same production more than once. This solves infinite looping as discussed by the author because the “loop” will be downgraded by the memoization to execute once. Of course I wouldn't literally use a while loop in code to represent the production. I would use a higher-level abstraction to indicate one-or-more or zero-or-more in the production; indeed I would represent productions as data not code.

This also has another benefit of work sharing. A production like `A B | C B` will ensure that in case parsing A or C consumes the same number of characters, the work to parse B will be shared, despite not literally factoring the production into `(A | C) B`.

reply
Porygon
6 hours ago
[-]
Memoization to limit left-recursive recursion is nicely described in Guido van Rossums' article here: https://medium.com/@gvanrossum_83706/left-recursive-peg-gram...

I recently tried that approach while simultaneously building an abstract syntax tree, but I dropped it in favor of a right-recursive grammar for now, since restoring the AST when backtracking got a bit complex.

reply
luizfelberti
10 hours ago
[-]
I also find this to be an elegant way of doing this, and it is also how the Thompson VM style of regex engines work [0]

It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.

[0] https://swtch.com/~rsc/regexp/regexp2.html

reply
smj-edison
13 hours ago
[-]
That's a slick way, would you essentially have a second counter that you'd set to the current cursor whenever you use `.currentToken()` or something like that?
reply