It’s easy to extract the earlier versions, for example with a plain text editor. Just search for lines starting with “%%EOF”, and truncate the file after that line. Voila, the resulting file is the respective earlier PDF version.
(One exception is the first %%EOF in a so-called linearized PDF, which marks a pseudo-revision that is only there for technical reasons and isn’t a valid PDF file by itself.)
I recently learned that some people improve or brush up on their OSINT skills by trying to find missing people!
Incremental updates are also essential for PDF signatures, since when you add a subsequent signature to a PDF, you couldn’t rewrite the file without breaking previous signatures. Hence signatures are appended as incremental updates.
PDFs don't change. PDFs are what they look like.
Except they aren't, because Adobe wanted to be able to (ahem) "annotate" them, or "save changes" to them. And Adobe wanted this because they wanted to sell Acrobat to people who would otherwise be using MS Word for these purposes.
And in so doing, Adobe broke the fundamental design paradigm of the format. And that has had (and continues to have, to hilarious effect) continuing security impact for the data that gets stored in this terrible format.
None of that could be accomplished with Word alone. I think you are underestimating the qualities of PDF for distribution of complex documents.
But they can! That's the bug, PDF is a mutable file format owing to Adobe's muckery. And you made the same mistake that every government redactor and censor (up to and including the ?!@$! NSA per the linked article) has in the intervening decades.
The file format you thought you were using was a great fit for your problem, and better than MS Word. The software Adobe shipped was, in fact, something else.
However, arbitrary non-trivial PostScript files were of little use to people without a hardware or software rasteriser (and sometimes fonts matching the ones the author had, and sometimes the specific brand of RIP matching the quirks of authoring software, etc.), so it was generally used by people in publishing or near it. PDF was an attempt to make a document distribution format which was more suitable to more common people and more common hardware (remember the non-workstation screen resolutions at the time). I doubt that anyone imagined typical home users writing letters and bulletins in Acrobat, of all things (though it does happen). It would be similar to buying Photoshop to resize images (and waiting for it to load each time). Therefore, competitor to Word it was not. Vice versa, Word file was never considered a format suitable for printing. The more complex the layout and embedded objects, the less likely it would render properly on publisher's system (if Microsoft Office did exist for its architecture at all). Moreover, it lacked some features which were essential for even small scale book publishing.
Append-only or versioned-indexed chunk-based file formats for things we consider trivial plain data today were common at the time. Files could be too big to rewrite completely each time even without edits, just because of disk throughput and size limits. The system could not be able to load all of the data into memory because of addressing or size limitations (especially when we talk about illustrations in resolutions suitable for printing). Just like modern games only load the objects in player's vicinity instead of copying all of the dozens or hundreds of gigabytes into memory, document viewers had to load objects only in the area visible on screen. Change the page or zoom level, and wait until everything reloads from disk once again. Web browsers, for example, handle web pages of any length in the same fashion. I should also remind you that default editing mode in Word itself in the '90s was not set to WYSIWYG for similar performance reasons. If you look at the PDF object tree, you can see that some properties are set on the level above the data object, and that allows overwriting the small part of the index with the next version to change, say, position without ever touching the chunk in which the big data itself stays (because appending the new version of that chunk, while possible, would increase the file size much more).
Document redraw speed can be seen in this random video. But that's 1999, and they probably got a really well performing system to record the promotional content. https://www.youtube.com/watch?v=Pv6fZnQ_ExU
PDF is a terrible format not because of that, but because its “standard” retroactively defined everything from the point of view of Acrobat developer, and skipped all the corner cases and ramifications (because if you are an Acrobat developer, you define what is a corner case, and what is not). As a consequence, unless you are in a closed environment you control, the only practical validator for arbitrary PDFs is Acrobat (I don't think that happened by chance). The external client is always going to say “But it looks just fine on my screen”.
PDF is designed to not require holding the complete file in memory. (PDF viewers can display PDFs larger than available memory, as long as the currently displayed page and associated metadata fits in memory. Similar for editing.)
ABCDE, to insert 1 after C: store D, overwrite D with 1, store E, overwrite E with D, write E.
1) Rewrite the file to disk 2) Append the new data/metadata to the end of the existing file
I suppose you could pre-pad documents with empty blocks and then go modify those in situ by binary editing the file, but that sounds like a nightmare.
Ext4 support dates as early as Linux 3.15, released in 2014. It is ancient at this point!
In addition, it’s generally nontrivial for a program to map changes to an in-memory object structure back to surgical edits of a flat file. It’s much easier to always just serialize the whole thing, or if the file format allows it, appending the serialized changes to the file.
Apart from that, file systems manage storage in larger fixed-size blocks (commonly 4 KB). One block typically links to the next block (if any) of the same file, but that’s about the extent of it.
This is why “table of contents at the end” is such an exceedingly common design choice.
Note that all (edit: color-/ink-) printers have "invisible to the human eye" yellow dotcodes, which contain their serial number, and in some cases even the public IP address when they've already connected to the internet (looking at you, HP and Canon).
So I'd be careful to use a printer of any kind if you're not in control of the printer's firmware.
There's lots of tools that started to decode the information hidden in dotcodes, in case you're interested [1] [2] [3]
[1] https://github.com/Natounet/YellowDotDecode
[2] https://github.com/mcandre/dotsecrets
[3] (when I first found out about it in 2007) https://fahrplan.events.ccc.de/camp/2007/Fahrplan/events/197...
It's mindboggling how much open-source 3d printing stuff is out there (and I'm grateful for it) but this is completely lacking in the 2d printing world
What can I do to resolve this? You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
Cloudflare Ray ID: 9bbed59d7bcd9dfc • Performance & security by Cloudflare
The MIC and yellow dots have been studied and decoded by many and all I've ever seen, including at your links, are essentially date + time + serial#.
Don't get me wrong ... stamping our documents with a fingerprint back to our printers and adding date and time is nasty enough. I don't see a need to overstate the scope of what is shared though.
I've got a black and white brother printer which uses toner. Is there something similar for this printer?
A tiny yellow dot on white paper is basically invisible to the human eye. Yellow ink absorbs blue light and no other light, and human vision is crap at resolving blue details.
A tiny black dot on white paper sticks out like a sore thumb.
excellent choice, that's what I am using. Also it's Linux / CUPS compatible and without a broken proprietary rasterizer.
But I've only seen research showing that it's possible. As far as I know nobody has demonstrated whether actual laser printers use that technique or not.
And of course we have to include the Wikipedia entry:
> (Added 2015) Some of the documents that we previously received through FOIA suggested that all major manufacturers of color laser printers entered a secret agreement with governments to ensure that the output of those printers is forensically traceable.
> This list is no longer being updated.
https://www.eff.org/pages/list-printers-which-do-or-do-not-d...
A more modern approach for text documents would be to have an LLM read and rephrase, and restructure everything without preserving punctuation and spacing, using a simple encoding like utf-8, and then use the technique above or just take analog pictures of the monitor. The analog (film) part protects against deepfakes and serves as proof if you need it (for the source and final product alike).
There various solutions out there after the leaks that keep happening where documents and confidential information is served/staged in a way that will reveal the person with who it is shared. Even if you copy paste the text into notepad and save it in ascii format, it will reveal you. Off-the-shelf printers are of course a big no-no.
If all else fails, that analog picture technique works best for exfil, but the final thing you share will still track back to you. I bet spies are back to using microfilms these days.
I only say all of that purely out of a fascination into the subject and for the sake of discussion (think like a thief if you want to catch one and all). Ultimately, you shouldn't share private information with unauthorized parties, period. Personal or otherwise. If you, like snowden, feel that all lawful means are exhausted and that is your only option to address some grievance, then don't assume any technique or planning will protect you, if it isn't worth the risk of imprisonment, then you shouldn't be doing it anyways. Assume you will be imprisoned or worse.
if really paranoid, I suppose one could run a filter on the image files to make them a bit fuzzy/noisy
Very tempting to fool around with the ideas especially after the Epstein pdf debacle.
Recently someone else revisited the Snowden documents and also found more info, but I can't recall the exact details.
Snowden and the archives were absolute gifts to us all. It's a shame he didn't release everything in full though.
[1]: https://www.electrospaces.net/2023/09/some-new-snippets-from...
[2]: Part 2: https://libroot.org/posts/going-through-snowden-documents-pa...
and part 3: https://libroot.org/posts/going-through-snowden-documents-pa...
Hopefully we'll hear something now that the Christmas holidays are over.
Is there something in here so damaging that they refuse to publish it?
Did the government tell them they'd be in trouble if they published it?
Are the journalists the only ones with access to the raw files?
of course, these concerns are only applicable when these "others" are Americans and the American institutions.
Everybody else can just fend for themselves.
Whats good for the goose, should be good for the gander. If American journalists feel like there is no problem with disclosing secrets of, say, Maduro, then they should not be protecting people like Trump (just as an example).
mutool clean -d in.pdf out.pdf
If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj). 1 0 obj
<<
/Type /Pages
/Count 1
/Kids [ 2 0 R ]
>>
endobj
2 0 obj
<<
/Type /Page
/Contents 5 0 R
...
>>
endobj
Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits. 1 1 obj
<<
/Type /Pages
/Count 2
/Kids [ 2 0 R 200 0 R ]
>>
endobj
You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).So it works kind of like a soft delete — dereference instead of scrubbing the bits.
Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?
But yeah. It's all just objects pointing at each other. It's mostly tree structured, but not entirely. You have a Catalog of Pages that have Resources, like Fonts (that are likely to be shared by multiple pages hence, not a tree). Each Page has Contents that are a stream of drawing instructions.
This gives you a sense of what it all looks like. The contents of a page is a stack based vector drawing system. Squint a little (or stick it through an LLM) and you'll see Tf switches to Font F4 from the resources at size 14.66, Tj is placing a char at a position etc.
2 0 obj
<<
/Type /Page
/Resources <<
/Font <<
/F4 4 0 R
>>
>>
/Contents 5 0 R
>>
endobj
5 0 obj
<<
/Length 340
>>
stream
q
BT
/F4 14.66 Tf
1 0 0 -1 0 .47981739 Tm
0 -13.2773438 Td <002B> Tj
10.5842743 0 Td <004C> Tj
ET
Q...
endstream
endobj
I'm going to hand wave away the 100+ different types of objects. But at it's core it's a simple model."The PDF format allows for previous changes to be retained in a revised version of the document, thereby keeping a running history of revisions to the document.
This tool extracts all previous revisions while also producing a summary of changes between revisions."
It is disappointing they didn't mark those sections "redacted", with an explanation of why.
It is also disappointing they didn't have enough technical knowhow to at least take a screenshot and publish that rather than the original PDF which presumably still contains all kinds of info in the metadata.
And to be honest, the journalists generally have done a great work on pretty much in all the other published PDFs. We've went through hundreds and hundreds of the published documents, and these two documents were pretty much the only ones which had metadata leak by a mistake revealing something significant (there are other documents as well with metadata leaks/failed redactions, but nothing huge). Our next part will be a technical deep-dive on PDF forensic/metadata analysis we've done.
Thank you.
I suspect you're inquiring about the use of LLMs, and about that I wonder: Why does it matter? Why are you asking?
By "hands-on" I'm asking whether the provided insight is the product of human intellection. Experienced, capable and qualified. Or at least an earnest attempt at thinking about something and explaining the discoveries in the ways that thinking was done before ChatGPT. For some reason I find myself using phrases involving the hands (etc. hands-on, handmade, hand-spun) as a metaphor for work done without the use of LLMs.
I emphasize insight because I feel like the series of work on the Snowden documents by libroot is wanting in that. I expressed as much the last time their writing hit the front page: <https://news.ycombinator.com/item?id=46236672>.
These are summaries. I don't think that it yields information that can't otherwise be pointed out and made mention of by others; presumably known and reputable. With as high-profile of an event that this is I'd expect someone covering it almost 16 years later to tell us beyond what when judged on the merit of its import amounts to a motivated section of the ‘Snowden disclosures’ Wikipedia entry.
The discussion that this series invites typically is centered around people's thoughts about the story of the Snowden documents in general, and in this case exchanges about technical aspects like how PDF documents work and can be manipulated in general. The one comment that I feel addresses the actual tension embedded in the article—"Who edited the documents?"—leads to accusations that the documents were tampered with by the media: <https://news.ycombinator.com/item?id=46566372>. I don't think that that's an implausible claim but I find issue with it being made with such confidence by the anonymous source behind the investigations (I'm withholding ironically putting "investigations" in...nevermind).
If the author actually provided something that advanced to the reader why this information is significant, what to do with or think about it and how they came about discovering the answers to the aforementioned 'why' and ‘what’ and additionally why they’re word ought to matter to us at all, I'd be less inclined to speculate that this is just someone vibe sleuthing their way through documents that on the surface are only significant to the public as the claim "the government is spying on you" is.
This particular post uncovers some nice information. It's a great find. I'm in no position to investigate whether it was already known. But what are we supposed to learn from it aside from "one of the documents were changed before it was made public". What's significant about the redaction? Is Ryan Gallagher responsible? Or does he know who is. Is he at all obliged to explain this to a presumably anonymous inquirer? Or is it now the duty of the public to expect an explanation as affected by said anonymous inquirer?
Remember when believing that the government was rife with pedophiles automatically associated you with horn-helmet-wearing insurrectionists?