Recreating Epstein PDFs from raw encoded attachments
168 points
1 day ago
| 14 comments
| neosmart.net
| HN
bawolff
45 minutes ago
[-]
Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

reply
chrisjj
1 hour ago
[-]
> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

reply
eek2121
55 minutes ago
[-]
I mean, the internet is finding all her mistakes for her. She is actually doing alright with this. Crowdsource everything, fix the mistakes. lol.
reply
TSiege
22 minutes ago
[-]
This would be funnier if it wasn’t child porn being unredacted by our government
reply
chrisjj
33 minutes ago
[-]
Let's see her sued for leaking PII. Here in Europe, she'd be mincemeat.
reply
pyrolistical
1 hour ago
[-]
It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

reply
bawolff
53 minutes ago
[-]
Sounds like a job for afl
reply
legitster
7 minutes ago
[-]
Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

reply
x-complexity
6 minutes ago
[-]
reply
kevin_thibedeau
44 minutes ago
[-]
pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.
reply
velaia
24 minutes ago
[-]
Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle
reply
pimlottc
1 hour ago
[-]
Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

reply
wahern
1 hour ago
[-]
It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure, reducing the cost similar to how you can crack passwords when the server doesn't use a constant-time memcmp. Are PDFs typically compressed by default? If so that makes it even easier given built-in checksums. But it's just not something you can do by throwing data at existing tools. You'll need to build a testing harness with instrumentation deep in the bowels of the decoders. This kind of work is the polar opposite of what AI code generators or naive scripting can accomplish.
reply
cluckindan
40 minutes ago
[-]
On the contrary, that kind of one-off tooling seems a great fit for AI. Just specify the desired inputs, outputs and behavior as accurately as possible.
reply
percentcer
1 hour ago
[-]
This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.
reply
WolfeReader
1 hour ago
[-]
You think compelling 76 people to honestly and accurately transcribe files is something that's easy and quick to accomplish.
reply
fragmede
1 hour ago
[-]
> Just get 76 people

I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.

reply
Krutonium
42 minutes ago
[-]
Amazon Mechanical Turk?
reply
FarmerPotato
1 hour ago
[-]
If only Base64 had used a checksum.
reply
zahlman
1 hour ago
[-]
"had used"? Base64 is still in very common use, specifically embedded within JSON and in "data URLs" on the Web.
reply
bahmboo
27 minutes ago
[-]
"had" in the sense of when it was designed and introduced as a standard
reply
blindriver
22 minutes ago
[-]
On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.
reply
eek2121
56 minutes ago
[-]
Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

reply
linuxguy2
1 hour ago
[-]
Love this, absolutely looking forward to some results.
reply
iwontberude
1 hour ago
[-]
This one is irresistible to play with. Indeed a nerd snipe.
reply
netsharc
1 hour ago
[-]
I doubt the PDF would be very interesting. There are enough clues in the human-readable parts: it's an invite to a benefit event in New York (filename calls it DBC12) that's scheduled on December 10, 2012, 8pm... Good old-fashioned searching could probably uncover what DBC12 was, although maybe not, it probably wasn't a public event.

The recipient is also named in there...

reply
RajT88
56 minutes ago
[-]
There's potentially a lot of files attached and printed out in this fashion.

The search on the DOJ website (which we shouldn't trust), gives back these results for the query: "Content-Type: application/pdf; name=", yields maybe a half dozen or so similarly printed BASE64 attachments.

There's probably lots of images as well attached in the same way (probably mostly junk). I deleted all my archived copies recently once I learned about how not-quite-redacted they were. I will leave that exercise to someone else.

reply
zahlman
1 hour ago
[-]
> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

reply