In a recent "Ask HN: What are you working on?" thread, I mentioned I was working on OCRing a large book:
https://news.ycombinator.com/item?id=41971614
The post generated some interest so I thought I would keep HN posted.
The book is Saint-Simon’s Memoirs -- an invaluable historical account of the French court under Louis XIV, full of wit, sharp observations, and of incredible literary value. I'm OCRing the edition of reference made between 1879-1930, that contains a lot of comments and footnotes: 45 volumes, ~27,000 pages.
Here's a link to a blog post that describes the techniques used so far (the project is still ongoing):
https://blog.medusis.com/38_Adventures+in+OCR.html
But you may also directly access the result here:
https://divers.medusis.net/boislisle/pub
This web app (not optimized for mobile, sorry) solves a tricky problem of preloading images efficiently. In short: preloading the next image isn't enough, since browsers will repaint if an image is moved, or scaled. Or browsers won't paint at all if visibility is hidden or opacity is zero, and will paint only when those values change. On an average, slow machine, this takes visible time. But if an image is simply behind another element, it will be painted, and the removal of the covering element or changing the z-index will not trigger a repaint.
(Preloading is important because it lets one review results fast; if one has to wait 150-200 ms between images it's simply discouraging).
Would love to hear feedback; happy to answer any question!
Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.
If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:
* giving each paragraph its own line
* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while
For now I reconstruct paragraphs in html but I could do markdown just as well (where paragraph breaks are marked by double line breaks, and single line breaks don't count).
Collaborative proofreading would be cool but it would require some way of properly tracking who wrote what, and I'm not sure what to use or if I should build a simple system from scratch. Do you have recommendations?
ScanTailor was tricky to start with; dunno if there's a manual. I remember belatedly realizing that there's automation at each step, that one can then quickly skim and manually adjust.
For collaborative editing, git via GitHub worked for us. Tracking who did what, and when, is easy. It allowed for sweeping edits covering multiple chapters. Building some porcelain on top of that, for less technical folks, could be good.
roughly:
> To obtain a Gallica document in high definition, contact utilisation.commerciale@bnf.fr.
My expectations would be very low, but I'd reach out to them anyway.
I got very good at this but page turning way a bore.
The PDF turned out in a mechanical fashion without much effort.
I made a few scripts to do TIFF to PDF and then stictching them and doing OCR.
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
I think it's not a bad result, and any minor imperfections could be revised easily in the markdown. My feature to turn the document into presentation slides got a bit confused because of the French language, so some slides ended up getting translated into English. But again, it wouldn't be hard to revise the slide contents using ChatGPT or Claude to make them all either French or English:
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
Footnotes are an important part of this project, if not the main point. If they are mangled with the main text then it's pointless. In your rendering there doesn't seem to be footnotes at all? just text with random titles here and there, and even more random tables that (to me) make no sense.
I wouldn't call that "minor imperfections". As it is, it really isn't usable.
Funny how vision models would almost be able to one-shot it, modulo some hallucination issues. Some of the research back then ~2020 was starting to use vision models for layout generations.
> Trying to get LLMs with vision to properly identify zones also were found to be slow and unreliable, and the risk of hallucinated results is unacceptable, especially as a first step. Non-deterministic systems may be fine for creative projects, but not here. (Once we have a reliable reference we can then play with LLMs and if necessary, control the results by measuring the distance to the source.)
He tried it for fixing footnotes and the result went "classic LLM":
> It was a complete flop. Using OpenRouter, I tested over 200 models. More than 70% couldn't even count the footnotes right, but that wasn't the worst part.
The "best" models just made stuff up to meet the requirements. They lied in three ways:
Basic (stupid) lies: wrong counts but claiming they matched ('foonotes: 5, references: 3, match: true') Better lies: claiming they placed references when they hadn't Premium lies: making up new text to attach footnotes to when they weren't sure where they went (against explicit instructions in the prompt never to do that)
I wonder what eventually happened with Ocropus which was supposed to help with page segmentation. I was a bit disappointed to see that this article used Google Vision as its OCR engine. I was hoping for something self hosted.
I uploaded a new item - https://archive.org/details/memoires-de-saint-simon-nouvelle... - though I made a mess of the metadata. It's still processing.
The edition I'm working on here, the "Boislisle", is completely different thanks of the richness and coverage of its footnotes (but the main text should be almost identical).
> The main difficulty of the is project lies in correctly identifying page zones; wouldn't it be possible to properly find the zones during the OCR phase itself instead of rebuilding them afterwards?
Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.
[1] https://unstract.com/llmwhisperer/
Examples of extracting complex layout:
Google Vision already returns the coordinates of each word (and even of each letter), so it's easy to know where the word was on the page, and even, if necessary, to rebuild the page with the words correctly placed -- that's fundamentally what I do with the mouseover on the interactive demo: https://divers.medusis.net/boislisle/pub (at the paragraph level).
But my problem isn't to know where the words are (Google Vision provides that); it's to know what belongs to what, what is footnotes, what is main text, etc. This is what the post discusses. Just having the text following the same layout as in the original wouldn't help, because I'm not trying to reproduce the layout or the typesetting, I want to rebuild the content semantically, so as to do different "flows".
That said, it got me thinking... there may be an opportunity to do a cheaper version of LLMwhisperer? ;-)
You should give a chance to groundx from eyelevel. Ready to discuss this further with you, and open an account on Ground x for you to test it: https://www.groundx.ai/
Thibaut
But we'll get there eventually with AIs. We'll be able to tell: "Find me the exact font, styles, etc. And re-render it using InDesign (or LaTeX or whatever fancies you), then compare with the source and see what you got wrong. Rinse and repeat".
We'll eventually have the ability to do just that.
Reliably identifying the superscript locations should be enough since they are in the same order as the footnotes.
It's a little early for feature requests... but I would love to see an EPUB edition! It shouldn't be too hard once done with the hard work of getting the data structured structured.
Now another idea is to manually put placeholders for footnotes references in the text, and then number them automatically. Before that, I manually enter the number of footnotes on each page, for verification. I have already done this for the first two volumes, it's pretty fast. Having the number of footnotes on a page lets:
- check that the number of footnotes is correct
- (and therefore) also check that footnotes numbers are also correct (from 1 to n, in order)
- also check that the number of footnotes references is also correct (should exactly match the number of footnotes)
- and finally, properly number the placeholders.
Manually inputing numbers in the main text would be very difficult and error-prone, but simply putting placeholders and checking them automatically, should be much faster and safer.
I'm curious: did you also check whether the signature mark was indeed found every 16 pages? Were there any scans missing?
Great project btw!
And in fact, there is a hiatus, because the introduction at the beginning is from a different "sub-book", where the pages are numbered using roman numerals. Typically the introduction would be written and typeset after the main book had been typeset, so its number of pages would not known in advance and that's why it uses a different numbering system.
So one finds a signature mark on pages 9, 25 41, 57, 73, 89, and then it starts again at page 93 109, 125, 141, 157, 173, 189, etc. (those numbers come from the filenames of the scans, not the numbers printed on the pages).
=> Another reason for not starting with the first signature mark and simply adding 16, is that would miss the changing of sub-book (or any irregular number of pages, for any reason).
Pros:
- no setup
- WMF foots the bill
Cons:
- you have to deal with the fragile egos of wikijerks
.. But there's no harm in trying. At the very least it could be done in conjunction with traditional OCR to check for whole sentences of pure invention.
For your use-case it would be exponentially easier as all you'd need to provide Gemini your "zones" as the JSON schema for output and it will quite reliably identify them.
OCR seems good, on par with Google Vision.
But the footnotes are not properly identified on most pages; they are properly identified when there is a large gap and the first line of the footnotes starts with a number; but when the footnotes block starts with text (continuing a footnote from a previous page) and/or the gap is small or almost non-existent, it fails (all text on the page is considered belonging to main text).
But the main problem isn't even that, it's that it takes between 10 to 20 seconds per page. That would mean over three hours per volume of 600 pages. Google Vision takes less than one second per page.
It's possible there is a setup cost and that doing batches or even full PDFs would be better, though. Do you have experience with this? And can you maybe share "prompt secrets" that would improve the results...?
- Please message me for more info if you like as this should be something LLMs can handle with ease.
This was my prompt (except for the URLs I provided the actual JPG, one as the example and the other the page to process). It took 2 seconds:
"
Extract the text from this page, correctly identifying the different zones the text appears in.
Pages contain the following possible zones (colors refer to the image)
-header (blue)
-comments in the margins (green)
-main text (pink)
-footnotes (yellow)
-signature mark (purple)
{https://blog.medusis.com/assets/38/ss_example_zones_colors.j...}
Extracting the text means correctly parsing the words in those different zones and reconstructing them properly, so as to produce readable text.
- Many pages have comments or subtitles in the margins that don't belong in the main text
- There are extensive footnotes that need to be properly tagged since they're not part of the main text
- Each page has a header that you might need to remove for smooth reading across pages
- On some pages, but not all, there's a 'signature mark' at the bottom. These need to be captured.
So you need to process the output to properly identify all these elements and mark paragraphs correctly.
{https://divers.medusis.net/boislisle/pub/tomes/t01/out/cropp...}
"
This is the JSON schema I told it to output in:
{ "type": "OBJECT", "properties": { "header": { "type": "string", "nullable": false }, "main_text": { "type": "STRING", "nullable": false }, "footnotes": { "type": "array", "nullable": true, "items": { "type": "string" } }, "comments": { "type": "STRING", "nullable": true }, "signature_mark": { "type": "STRING", "nullable": true } }, "required": [ "header", "main_text", "footnotes", "comments", "signature_mark" ] }
This was the output:
{ "comments": "Guimaraes, son caractère et ses mœurs.", "footnotes": [ "1. Voyez une lettre du général Marquis de Saint-Simon, dans le Moniteur, du 18 août 1838. — Cet ouvrage, cessionnaire de Boisange, eut vingt et un volumes in-8° par Sautel, publiés par Delloye, et celle de 1883, publiée par les frères Ducharne, quarante volumes in-18.", "2. L'édition de 4820-4830, la publiée de l'édition de 1840.", "3. Mémoires complets et authentiques du duc de Saint-Simon sur le règne de Louis XIV, et la Régence, collationnés sur le manuscrit ori- ginal par M. Chéruel, et précédés d'une notice par Sainte-Beuve, de l'Académie française. — Paris, 1856, in-8° de 1840 pages. — Cette édition est imprimée en deux volumes, sans faute, et avec une exactitude parfaite, en raison des volumes de 1861 ; un autre, dans le format in-42, ac- compagné de dix-un.", "4. En treize volumes. — Un premier tirage, sans le concours ; un troi- sième, dans le format in-18, en 1883, et un quatrième en 1865, dans le format in-16.", "5. Cette maison venait d'inaugurer sa Bibliothèque des chemins de fer, qui contribua beaucoup au succès de cette publication.", "6. Sa propriété est particulièrement confirmée par des arrêts anté- rieurs à l'acquisition ; l'un du tribunal de première instance de Paris en date du 8 juin 1856, un autre de la Cour d'appel en date du 8 fé-" ], "header": "MEMOIRES DE SAINT-SIMON.", "main_text": "ce manuscrit, en y pratiquant toutefois ce qu'il appelait « les corrections et les retranchements indispensables ». Outre cette première édition, datée de 1829-1830, les Mémoires complets et authentiques du duc de Saint-Si- mon sur le siècle de Louis XIV et la Régence furent deux fois réimprimés par les soins du général de Saint-Simon en 1840 et 1856, avant que M. Chéruel obtint de faire l'édition de 1856, que depuis lors, on a considéré, non sur l'original une nouvelle revision ou d'ont sorties sans raison, comme édition principale, et plusieurs réim- pressions successives du texte sec, en moindre format, toutes faites par la maison Hachette³, qui devint propriétaire du manuscrit des Mémoires.", "signature_mark": null }
As for segmenting the images (header/footer/table/main text) I’ve been using Abbyy and it’s generally pretty good at it. It unfortunately often fails at footnotes in much the same way as described in the post, so it won’t get you past that hurdle.
I just use ChatGPT for spelling fixes (i.e. when rewriting articles). You just have to instruct it to NOT auto-rephrase the article.
In context: heh.
(I know, typo not OCR-o, but still...)