The PDF 2.0 spec says in section 7.5.3, "The body of a PDF file shall consist of a sequence of indirect objects representing the contents of a document." I'd read that as establishing the entire contents of the file body. Of course, real-world PDFs might have all sorts of garbage that a practical parser should be prepared for, but I don't think that it's condoned by the standard.
> Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited.
I think the standard tokenization would prevent any string "obj" inside of an indirect object from actually being a keyword obj that starts a new indirect object. (And if the file body as a whole weren't tokenized from start to end, then "a sequence of indirect objects" would be nonsensical.)
I think they wanted to demonstrate that their work can slice a stream by offset table, in a declarative fashion. It is a useful property. I think they would've better picked OTF/TTF for demonstration of this particular feature.
In fact, the authors state "PDF is picked because it is the most complicated format to our knowledge, which requires some unusual parser behaviors. We did not implement a full PDF parser due to its complexity, but a functional subset to show how IPGs can support some interesting features (...) PDF is a more complicated format. Our IPG grammar for PDF does not support full PDF parsing but focuses on how some interesting features in PDF are supported. As a result, the parser generated from our IPG PDF grammar can parse simple PDF files"
> With attributes and intervals, IPGs allow the specification of data dependence as well as the dependence between control and data.
> Moreover, parser termination checking becomes possible.
> To further utilize the idea of intervals, an interval-based, monadic parser combinator library is proposed.
This sounds like a well-behaved variant. Adding local attribute references simplifies the grammar and is tractably implemented.
This might support classifying and implementing formats by severability + composability, i.e., whether you can parse one part at the same time as another, or at least find/prioritize precursor structures like indexes.
The yet-unaddressed streaming case is most interesting:
> We can first have an analysis that determines if it is possible to generate a stream parser from an IPG: within each production rule, it checks if the attribute dependency is only from left to right. After this analysis, a stream parser can be generated to parse in a bottom-up way
For parallel composition, you'd want to distinguish the attributes required by the consuming/combining (whole-assembly) operation from those only used in the part-parsing operation to plan the interfaces.
Aside from their mid-level parser-combinators, you might want some binary-specific lowering operations (as they did with Int) specific to your target architecture and binary encodings.
For the overall architecture it seems wise for flatbuffers et al to expressly avoid unbounded hierarchy. Perhaps three phases (prelude+split, process, merge+finish) would be more manageable than fully-general dependency stages possible with arbitrary attribute dependencies.
I would hate to see a parser technology discounted because it doesn't handle the crap of PDF or even MS xml. I'd be very interested in a language that could constrain/direct us to more performant data formats, particularly for data archives like genomics or semantics where an archive-resident index can avoid full-archive scans in most use-cases.
For PDF, that's fair. Video "Types of PDF - Computerphile" covers this: https://www.youtube.com/watch?v=K7oxZCgO1dY
To be fair, the ability to stick a ZIP file at the end of any other kind of file enables all sorts of neat tricks (like the old self-extracting zips).
[0] https://github.com/sealmove/binarylang
[1] https://github.com/khaledh/elfdump/blob/master/elfparse.nim
I guess that's good for preventing off-by-one-based parsing errors, but surely there's prior art from long ago.
I once asked a question related to this on the computer science stack overflow:
https://cs.stackexchange.com/q/60718
Would someone like to add this as an answer?
But indexing PDFs, now there's a fun one.