FilterHN

Interval parsing grammars for file format parsing (2023)

108 points

by fanf2

11 months ago

| past

| 8 comments

| dl.acm.org

| HN

▲

pointlessone

11 months ago

[-]

They definitely did not implement PDF parsing, even a subset of it. They make some assumptions that will definitely result in incorrect parsing. For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to. Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited. As far as I can tell this is where their PDF parsing ends. They don't parse the objects themselves (not regular objects, nor stream objects). So they've chosen PDF "because it is the most complicated format to our knowledge" but ended up just (incorrectly) chunking the stream by offset table.

▲

LegionMammal978

11 months ago

[-]

> For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to.

The PDF 2.0 spec says in section 7.5.3, "The body of a PDF file shall consist of a sequence of indirect objects representing the contents of a document." I'd read that as establishing the entire contents of the file body. Of course, real-world PDFs might have all sorts of garbage that a practical parser should be prepared for, but I don't think that it's condoned by the standard.

> Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited.

I think the standard tokenization would prevent any string "obj" inside of an indirect object from actually being a keyword obj that starts a new indirect object. (And if the file body as a whole weren't tokenized from start to end, then "a sequence of indirect objects" would be nonsensical.)

▲

pointlessone

11 months ago

[-]

An object can be placed into a stream without breaking any syntactic or semantic rules.

▲

jahewson

11 months ago

[-]

Yeah this work is far away from what a real PDF parser requires. It’s not uncommon for the lengths at the beginning of streams to be wrong or the data to be encoded in a format different from the one claimed. The offset table can also be wrong or missing.

▲

pointlessone

11 months ago

[-]

Malformed file is a whole another can of worms a good parser should know how to deal with but here it doesn't even format compliant.

I think they wanted to demonstrate that their work can slice a stream by offset table, in a declarative fashion. It is a useful property. I think they would've better picked OTF/TTF for demonstration of this particular feature.

▲

airstrike

11 months ago

[-]

Sounds to me like that's more of an issue with the PDF specification than with the work presented in the paper, in which case that's hardly the metric by which we should measure its merit.

▲

pointlessone

11 months ago

[-]

I’m not saying PDF is a good format. I’m pointing out that they’ve made a poor choice going for PDF. There are other formats they could’ve used to demonstrate this specific technique. Like OTF/TTF which is a more traditional binary format with a whole range of approaches, including offset tables.

▲

airstrike

11 months ago

[-]

The abstract says "We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF". Sounds to me like they threw PDF in there to test the limits of the technique _after_ using it on a bunch of other unrelated formats.

In fact, the authors state "PDF is picked because it is the most complicated format to our knowledge, which requires some unusual parser behaviors. We did not implement a full PDF parser due to its complexity, but a functional subset to show how IPGs can support some interesting features (...) PDF is a more complicated format. Our IPG grammar for PDF does not support full PDF parsing but focuses on how some interesting features in PDF are supported. As a result, the parser generated from our IPG PDF grammar can parse simple PDF files"

▲

cqqxo4zV46cp

11 months ago

[-]

Kinda sounds like you didn’t read what they actually wrote in its entirety but have still taken the first possible chance to jump in and quite aggressively tell them how what they’re doing is wrong.

▲

w10-1

11 months ago

[-]

Snippets from the summary about the most promising aspects

> With attributes and intervals, IPGs allow the specification of data dependence as well as the dependence between control and data.

> Moreover, parser termination checking becomes possible.

> To further utilize the idea of intervals, an interval-based, monadic parser combinator library is proposed.

This sounds like a well-behaved variant. Adding local attribute references simplifies the grammar and is tractably implemented.

This might support classifying and implementing formats by severability + composability, i.e., whether you can parse one part at the same time as another, or at least find/prioritize precursor structures like indexes.

The yet-unaddressed streaming case is most interesting:

> We can first have an analysis that determines if it is possible to generate a stream parser from an IPG: within each production rule, it checks if the attribute dependency is only from left to right. After this analysis, a stream parser can be generated to parse in a bottom-up way

For parallel composition, you'd want to distinguish the attributes required by the consuming/combining (whole-assembly) operation from those only used in the part-parsing operation to plan the interfaces.

Aside from their mid-level parser-combinators, you might want some binary-specific lowering operations (as they did with Int) specific to your target architecture and binary encodings.

For the overall architecture it seems wise for flatbuffers et al to expressly avoid unbounded hierarchy. Perhaps three phases (prelude+split, process, merge+finish) would be more manageable than fully-general dependency stages possible with arbitrary attribute dependencies.

I would hate to see a parser technology discounted because it doesn't handle the crap of PDF or even MS xml. I'd be very interested in a language that could constrain/direct us to more performant data formats, particularly for data archives like genomics or semantics where an archive-resident index can avoid full-archive scans in most use-cases.

▲

andrybak

11 months ago

[-]

> We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF

For PDF, that's fair. Video "Types of PDF - Computerphile" covers this: https://www.youtube.com/watch?v=K7oxZCgO1dY

▲

jgalt212

11 months ago

[-]

I"ll watch any and all Professor Brailsford videos.

▲

quotemstr

11 months ago

[-]

> ZIP files that are prefixed by random garbage can still be extracted by unzip but fail to be recognized by a parser that conforms to the format specification

To be fair, the ability to stick a ZIP file at the end of any other kind of file enables all sorts of neat tricks (like the old self-extracting zips).

▲

userbinator

11 months ago

[-]

That's because zip files are read from the end.

▲

FreakLegion

11 months ago

[-]

And this is in fact what the spec lays out, contrary to the quote from the paper. The PK header is a convention. Conforming parsers don't require it, but lazy implementations often do. This has led to more than one security incident over the years.

▲

bloatfish

11 months ago

[-]

Yeah and PK is the signature per record - it's not a file header. Did these guys read the format specification at all?

▲

khaledh

11 months ago

[-]

Reminds me of binarylang[0] (a library for binary parsing written in Nim). I used it in a small hobby project to parse ELF binaries in a declarative manner (well just the headers + string table)[1].

[0] https://github.com/sealmove/binarylang

[1] https://github.com/khaledh/elfdump/blob/master/elfparse.nim

▲

aappleby

11 months ago

[-]

Is this really a new thing? It feels like they've just crammed a sliver of the same bog-standard parsing we've been doing for decades back into the CFG.

I guess that's good for preventing off-by-one-based parsing errors, but surely there's prior art from long ago.

▲

matheusmoreira

11 months ago

[-]

So cool to see that binary format parsing is finally being formalized.

I once asked a question related to this on the computer science stack overflow:

https://cs.stackexchange.com/q/60718

Would someone like to add this as an answer?

▲

revskill

11 months ago

[-]

How about MS office document ?

▲

tithe

11 months ago

[-]

DOCX, PPTX, and XLSX Microsoft Office files are actually ZIP archives (which the paper addresses). You can append a ".zip" extension onto the end of them and explore.

https://en.wikipedia.org/wiki/Office_Open_XML

▲

emj

11 months ago

[-]

Last time I tried to parse .docx it was full of opaque binary blobs, it might be a zip but parsing the data is like summoning arcane magic. It might have changed in the last decade, but considering the Microsoft has no incitement to make the situation better parsing it is always going to be a "fun" exercise.

▲

tithe

11 months ago

[-]

I was writing an indexer (ca. 2018), and I don't recall encountering opaque blobs, but parsing the ZIP file and XML (with a small C XPath scanner) was straightforward.

But indexing PDFs, now there's a fun one.

▲

jahewson

11 months ago

[-]

The old office binary formats are basically a FAT file system containing streams of unremarkable records. Knowing what those records do is the hard part!