FilterHN

The war against PDFs is heating up

19 points

by pseudolus

1 hour ago

| past

| 11 comments

| economist.com

| HN

▲

pseudolus

1 hour ago

[-]

https://archive.ph/aCleq

▲

barrister

1 hour ago

[-]

Seems to be a weak pitch for an Israeli startup called Factify. Their new document type is also closed sourced which seems like an obvious showstopper for a ubiquitous global document replacement, especially in today's extremely heated and untrustworthy environment.

No strong argument imo for replacing the pdf.

▲

sghaz

2 minutes ago

[-]

This looks like an sponsored article. Very poor quality.

▲

maxloh

20 minutes ago

[-]

For context, here is the startup's website: https://www.factify.com/. The site consists of only two main pages: the landing page and a "careers" section.

Based on the site, the service appears to be little more than a document hosting platform with tracking features, such as monitoring who copied the document and the specific paragraphs they selected. They’ve intentionally omitted a download feature to prevent access to outdated versions, but otherwise, the experience seems no different from an ordinary PDF reader.

There is no mention of a "new standard" on their front page. I suspect they don't actually convert the documents. They likely just convert pages to encrypted images and use client-side rendering for text elements to allow for selection and copying.

▲

pavel_lishin

1 hour ago

[-]

> Yet Duff Johnson, head of the PDF Association, protector of the format, argues that the fault lies not in the file type but in ourselves. He contends that there is no reason developers cannot build bots that are able to use PDFs. The AI assistant embedded in Acrobat, Adobe’s PDF reader, is designed to do precisely that, notes Leonard Rosenthol, the software firm’s PDF guru.

Designed to, but does it do it well without the problems noted earlier in the article?

▲

ssl-3

30 minutes ago

[-]

Strictly anecdotally, I've had no trouble feeding PDFs to OpenAI's bot.

The searchable PDFs get searched, and the just-pictures-of-words ones get fed through their (quite good, IMHO) OCR.

I use it all the time. It's remarkably good for locating the details I need in the poorly-organized ~1,200 page factory manual for my Honda.

(Well, it's not necessarily organized poorly. It's just designed with the clear intent that it is mostly to serve as a set of repair instructions, and sometimes I don't want repair instructions. Sometimes I want to know how a thing works for my own cognitive benefit instead of how diagnose and R&R it as a series of steps.)

▲

cyberax

14 minutes ago

[-]

I'm using paperless-ngx for personal document management, and Claude Desktop was able to read and OCR all the PDFs there just fine (through an MCP connector).

It also was able to parse my tax forms in 3 languages.

▲

Gualdrapo

35 minutes ago

[-]

Makes me remember of this, which was posted a few days ago here in HN:

https://scottlocklin.wordpress.com/2023/05/31/djvu-and-its-c...

▲

dhosek

1 hour ago

[-]

Well, that was a nonsense article. Badly written software has trouble with PDFs, accessibility is an afterthought (which, sadly, is true of most things) and some small group thinks they can invent a better wheel, ignoring the fact that they’d have to do a lot of work to overcome the first mover advantages of HTML and PDF and this comment now has more information than the original article thanks to that clause beginning with “ignoring”.

▲

lsbehe

26 minutes ago

[-]

I'll miss getting documentation as a pile of pictures in a PDF.

▲

cratermoon

53 minutes ago

[-]

There are PDF files and there are PDF files. Many (most?) PDFs I run into are generated from Microsoft Word or some other MS product with no structure at all. The majority of people use MS products don't understand or care about structure. The WYSIWYG imperative means lots of markup to describe font size, color, and decoration, to make every section heading look the same without ever designating the text as a section head. The same happens with paragraphs, page breaks, and column flow. The resulting document looks correct enough to the creator. Other people who have a different version of Word, different fonts, and a thousand other little differences, won't see it correctly. That leads our author to generate a PDF, probably with embedded fonts, to ensure uniform appearance across these thousand little exceptions.

The result is a document with the content mixed up so incomprehensibly with appearance controls as to be both unreadable and without any residue of the underlying intended structure of the document's sections, headers, figures, paragraphs, captions, footnotes, or anything.

And then there's PDF files which are nothing more than a series of images of pages of text. If you're lucky and the scans are clean a good OCR might be able to recover most of the content.

What I'm saying is, it doesn't matter the tool, if authors don't encode structure and formatting in semantically meaningful ways.

▲

zokier

3 minutes ago

[-]

[delayed]

▲

tpm

47 minutes ago

[-]

So what you are actually saying is that there is a market for a tool that will recreate the PDF with a structure based on how the original PDF looks?

▲

cratermoon

33 minutes ago

[-]

The market has been needing a tool like that for 30 years. A PDF document of the type I describe is like a broken egg. Information is lost between the authoring and rendering, to the extent that it's not clear recreating the original is even possible.

▲

pessimizer

23 minutes ago

[-]

A typesetter could recreate the document through looking at it, doing some font research, and playing with the kerning for a while. Saying it's not possible to recreate a typeset document that is readable is absurd, no matter how twisted and insane the actual postscript is.

▲

pessimizer

26 minutes ago

[-]

The war against pdfs is based on AI being too stupid to read them? That's a condemnation of AI, not pdfs. I, a natural intelligence, can easily read pdfs.

▲

ur-whale

52 minutes ago

[-]

https://archive.is/aCleq