FilterHN

DjVu and its connection to Deep Learning (2023)

85 points

by tosh

1 month ago

| past

| 7 comments

| scottlocklin.wordpress.com

| HN

▲

stared

1 month ago

[-]

Oh, my favourite format during my undergraduate time! Most books in mathematics and physics (some old and niche) were available in the "Russian library".

At the same time, I haven't yet seen DjVu used in a legit way.

▲

cxr

1 month ago

[-]

Licensing concerns resulted in DjVu being originally preferred over PDF by archive.org and WMF projects like Wikipedia. With baseline PDF now being unencumbered and the widespread existence of FOSS readers, PDF is both the de jure and de facto standard across even those sites.

▲

gwern

1 month ago

[-]

Also, PDF caught up on size with JBIG2, and tooling/support keeps getting worse.

(Not so fun fact: if you punch "filetype:djvu" into Google right now, you can easily page through what supposedly is every DjVu file on the Internet as far as Google knows, which is not many: "In order to show you the most relevant results, we have omitted some entries very similar to the 300 already displayed." I learned this the hard way when I began wondering why a bunch of DjVu fulltexts I hosted never seemed to show up in Google or Google Scholar...)

▲

Rochus

1 month ago

[-]

I'm usign DJVU files every day. It's just a great format. I have a lot of archived documents which are much faster to use and require much less space than equivalent PDF documents.

▲

jesuslop

1 month ago

[-]

As I understand, the technology was protected by a patent help by guys at Leptonica and it exprided. There is a crude project for encoding images to jbig2 at https://github.com/agl/jbig2enc. I am sharing my personal scripts here [1] (windows) that wrap that for end to end djvu to pdf for scanned texts using jbig2 compressed images in the pdf instead of jpeg. This combines decent compression with pdf handiness. djvu still compresses better but pdfs can be got under twice the side, that sounds no impressive, but many common available pipelines produce sizes x3, x4 and worse, a particular offender those using ghostscript pdfwriter. The sripts have worked months locally but are given "as is" without testing, with zero support, you deal with python dependencies and having jbig2 and djvu-libre tools in the path. Beyond image compression tech, they support OCR-layer (cut/pasteability), bookmark and page label migration from djvu to pdf info.

[1] https://github.com/jesuslop/djvu2pdf-test

▲

jesuslop

1 month ago

[-]

I closed public access due to lack of interest

▲

qdotme

1 month ago

[-]

Another reason why I think it failed (TIL Yann LeCun was the coauthor) is the connotation with the pirate books/articles community.

When I came across this format in college days, when handling lots of scanned material, it always triggered the mental “don’t install suspicious software” block. Which is a shame as the article points out it was the superior format.

▲

joecool1029

1 month ago

[-]

Really hate that archive abandoned it. djvu files are much smaller, faster, and high quality than pdf. Real reason for abandoning it was probably to allow for the DRM needed for controlled access lending, because it’s a garbage choice otherwise.

▲

nico_h

1 month ago

[-]

I don’t know how relevant the samples are, but while the details are lost, the essence seems well preserved. It seems it would be really useful for performing OCR on.

▲

qingcharles

1 month ago

[-]

Ironically, because of poor software support and lack of knowledge about the format, most DjVus are slowly being converted to PDFs.

▲

EvanAnderson

1 month ago

[-]

A court in my local government has been using a document imaging system since the early 2000's. It stored documents as DjVu files until a couple of years ago when the vendor re-encoded all the documents as PDF to comply with mandates for file storage format from my state Supreme Court. It made me really sad.

▲

anthk

1 month ago

[-]

Djvu/djview it's libre software with open standards. The issue of "lack of knowledge" it's a bit bullshit.

▲

MrDrMcCoy

1 month ago

[-]

So, I love DjVu and think it's a superior format to PDF. _Consuming_ DjVu is easy, but when was the last time you interacted with the tools to _create_ them? I can say from direct experience that they are awful.

▲

anthk

1 month ago

[-]

GhostScript should have shipped DjVU drivers by default long ago.

▲

qingcharles

1 month ago

[-]

I mean "lack of knowledge" in so much as most people have never come across a DjVu file in their lives and when they get one they find that their system won't open it, so they will instead go looking for a PDF version of the same document.

▲

vee-kay

1 month ago

[-]

DjVu is excellent format for e-Comics and e-Magazines.

Check out the Amazing Science Fiction Stories, Amazing Stories, Planet Stories, Weird Tales and more.. in DjVu format: https://commons.wikimedia.org/wiki/Category:Scanned_English_...

▲

aidenn0

1 month ago

[-]

Note that PDF :

1. Supports JPEG2000 compression, which is very similar to what DjVu uses for images

2. Supports JPEGs compressed with jpegli which is competitive with DjVu at higher quality settings

3. Supports JBIG2 for bi-level images, which is very similar to what DjVu uses for bi-level layers.

▲

jbaber

1 month ago

[-]

Any combination of ghostscript flags or something to turn a random pdf into one that uses these things to make a pdf as fast and small as a djvu?

▲

aidenn0

1 month ago

[-]

https://github.com/internetarchive/archive-pdf-tools

Though note that this uses j2k by default and jpegoptim for JPEGs. For pages that are mostly just images (e.g. color comics) I prefer to use cjpegli on each page and img2pdf to combine them to a PDF.

Modifying archive-pdf-tools to allow use of cjpegli is something I keep meaning to look into[1], but not at the top of my list.

1: In my tests, cjpegli is more consistent than j2k compressors; that is, for each image there is a setting that j2k does as good, or better, than JPEG, but there is no setting for which j2k averages better than cjpegli because cjpegli just does such a good job of aggressively compressing while always looking good

▲

ValdikSS

1 month ago

[-]

ghostscript does not support jbig encoding, only decoding.

▲

rahimnathwani

1 month ago

[-]

Right, if you look at PDF files from Internet Archive, they're usually compressed with MRC (Mixed Raster Content).

IIRC each page has three layers:

- background (jpeg, color)

- foreground (jbig2, monochrome maybe?)

- mask (indicating whether foreground or background should be shown at this point)

https://github.com/internetarchive/archive-pdf-tools