What's interesting is that I asked it to also read the background colors of the cells and it did much worse on that task.
I believe these models could be useful for a first pass if you are willing to manually review everything they output, but the failure mode is unsettling.
>OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
I ... I nailed it.
In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.
Also shows a way to do that fast:
“ First, he wrote assembly language routines to isolate the bounding box of each character in the selected range. Then he computed a checksum of the pixels within each bounding box, and compared them to a pre-computed table that was made for each known font, only having to perform the full, detailed comparison if the checksum matched.”
OCR'ing a fixed, monospaced, font from a pristine piece of paper really is "solved." It's all the nasties of tue real world that its an issue.
As I mockingly demonstrated- kerning, character similarity, grammar, lexing- all present large and hugely time consuming problems to solve in processes where OCR is the most useful.
On the other hând, LLm5 are sl0wwer, moré resource hangry and l3ss accurale fr their outpu1z.
We shoulD stop gl0rıfying LLMs for 3verylhin9.
I'm not saying this applies to you, but my sense from this thread is that many are comparing the results of tossing an image into a free ChatGPT session with an "OCR this document" prompt to a competent Tesseract-based tool... LLMs certainly don't solve any and every problem, but this should be based on real experiments. In fact, OCR is probably the main area where I've found them to simply be the best solution for a professional system.
But, theres also a ton of "I don't want to deal with this" type work items that can't justify a full workflow process build out- but that LLMs get near enough to perfect to be "good enough." The bad part is, the LLMs don't explain to people the kinds of mistakes to expect from them.
Trilobites? Those were truly primitve computers.
- It transcribed all of the text, including speech, labels on objects, onomatopoeias in actions, etc. I did notice a kana was missing a diacritic in a transcription, so the transcriptions were not perfect, but pretty close actually. To my eye all of the kanji looked right. Latin characters already OCR pretty well, but at least in my experience other languages can be a struggle.
- It also, unprompted, correctly translated the fairly simple Japanese to English. I'm not an expert, but the translations looked good to me. Gemini 2.5 did the same, and while it had a slightly different translation, both of them were functionally identical, and similar to Google Translate.
- It also explained the jokes, the onomatopoeias, etc. To my ability to verify these things they seemed to be correct, though notably Japanese onomatopoeias used for actions in comics is pretty diverse and not necessarily super well-documented. But contextually it seemed right.
To me this is interesting. I don't want to anthropomorphize the models (at least unduly, though I am describing the models as if they chose to do these things, since it's natural to do so) but the fact that even relatively small local models such as Gemma can perform tasks like this on arbitrary images with handwritten Japanese text bodes well. Traditional OCR struggles to find and recognize text that isn't English or is stylized/hand-written, and can't use context clues or its own "understanding" to fill in blanks where things are otherwise unreadable; at best they can take advantage of more basic statistics, which can take you quite far but won't get you to the same level of proficiency at the job as a human. vLLMs however definitely have an advantage in the amount of knowledge embedded within them, and can use that knowledge to cut through ambiguity. I believe this gets them closer.
I've messed around with using vLLMs for OCR tasks a few times primarily because I'm honestly just not very impressed with more traditional options like Tesseract, which sometimes need a lot of help even just to find the text you want to transcribe, depending on how ideal the case is.
On the scale of AI hype bullshit, the use case of image recognition and transcription is damn near zero. It really is actually useful here. Some studies have shown that vLLMs are "blind" in some ways (in that they can be made to fail by tricking them, like Photoshopping a cat to have an extra leg and asking how many legs the animal in the photo has; in this case the priors of the model from its training data work against it) and there are some other limitations (I think generally when you use AI for transcription it's hard to get spatial information about what is being recognized, though I think some techniques have been applied, like recursively cutting an image up and feeding it to try to refine bounding boxes) but the degree to which it works is, in my honest opinion, very impressive and very useful already.
I don't think that this demonstrates that basic PDF transcription, especially of cleanly-scanned documents, really needs large ML models... But on the other hand, large ML models can handle both easy and hard tasks here pretty well if you are working within their limitations.
Personally, I look forward to seeing more work done on this sort of thing. If it becomes reliable enough, it will be absurdly useful for both accessibility and breaking down language barriers; machine translation has traditionally been a bit limited in how well it can work on images, but I've found Gemini, and surprisingly often even Gemma, can make easy work of these tasks.
I agree these models are inefficient, I mean traditional OCR aside, our brains do similar tasks but burn less electricity and ostensibly need less training data (at least certainly less text) to do it. It certainly must be physically possible to make more efficient machines that can do these tasks with similar fidelity to what we have now.
For each page:
- Extract text as usual.
- Capture the whole page as an image (~200 DPI).
- Optionally extract images/graphs within the page and include them in the same LLM call.
- Optionally add a bit of context from neighboring pages.
Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.
At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.
Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.
It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.
I can recommend the Mistral OCR API [1] if you have large jobs and don't want to think about it too much.
Here is the corresponding GitHub issue for your default model (Qwen2.5-VL):
https://github.com/QwenLM/Qwen2.5-VL/issues/241
You can mitigate the fallout of this repetition issue to some degree by chopping up each page into smaller pieces (paragraphs, tables, images, etc.) with a page layout model. Then at least only part of the text is broken instead of the entire page.
A better solution might be to train a model to estimate a heat map of character density for a page of text. Then, condition the vision-language model on character density by feeding the density to the vision encoder. Also output character coordinates, which can be used with the heat map to adjust token probabilities.
I'm personally on the watchout for the absolute best possible multilingual OCR performance, local or not, cost it what it may (almost).
I’d just prefer that any images and diagrams are copied over, and rendered into a popular format like markdown.
Seems to weigh about 6GB which feels reasonable to manage locally
https://github.com/ocrmypdf/OCRmyPDF
No LLMs required.
Python: PyPdf2, PdfMiner.six, Grobid, PyMuPdf; pytesseract (C++)
paperetl is built on grobid: https://github.com/neuml/paperetl
annotateai: https://github.com/neuml/annotateai :
> annotateai automatically annotates papers using Large Language Models (LLMs). While LLMs can summarize papers, search papers and build generative text about papers, this project focuses on providing human readers with context as they read.
pdf.js-hypothes.is: https://github.com/hypothesis/pdf.js-hypothes.is:
> This is a copy of Mozilla's PDF.js viewer with Hypothesis annotation tools added
Hypothesis is built on the W3C Web Annotations spec.
dokieli implements W3C Web Annotations and many other Linked Data Specs: https://github.com/dokieli/dokieli :
> Implements versioning and has the notion of immutable resources.
> Embedding data blocks, e.g., Turtle, N-Triples, JSON-LD, TriG (Nanopublications).
A dokieli document interface to LLMs would be basically the anti-PDF.
Rust crates: rayon handles parallel processing, pdf-rs, tesseract (C++)
pdf-rs examples/src/bin/extract_page.rs: https://github.com/pdf-rs/pdf/blob/master/examples/src/bin/e...
It's basically an SQL wrapper around poppler.
That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.
My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.
I have a very similar Go script that does this. My prompt: Create a CSV of the handwritten text in the table. Include the package number on each line. Only output a CSV.
But that is something I will use for sure. Thank you.
No, it isn’t.
Does anyone have a suggestion for locally converting PDFs of handwriting into text, say on a recent Mac? Use case would be converting handwritten journals and daily note-taking.
1. https://github.com/pnshiralkar/text-to-handwriting/blob/mast...
FYI, your GitHub link tells me it's unable to render because the pdf is invalid.
My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server
UPDATE: I just tried this with the default model on handwriting, and IT WORKED. Took about 5-10 minutes on my laptop, but it worked. I am so thrilled not to have to send my personal jottings into the cloud!
Also, watch out, it seems the weights do not carry a libre license https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main...
LLMWhisperer(from Unstract), Docling(IBM), Marker(Surya OCR), Nougat(Facebook Research), Llamaparse.
TRANSCRIPTION_PROMPT = """Task: Transcribe the page from the provided book image.
- Reproduce the text exactly as it appears, without adding or omitting anything. - Use Markdown syntax to preserve the original formatting (e.g., headings, bold, italics, lists). - Do not include triple backticks (```) or any other code block markers in your response, unless the page contains code. - Do not include any headers or footers (for example, page numbers). - If the page contains an image, or a diagram, describe it in detail. Enclose the description in an <image> tag. For example:
<image> This is an image of a cat. </image>
"""
The url to connect to ollama seems to just be hard coded so I don't see why you couldn't point this at a different machine on your network rather than having Ollama running locally on every machine you need this for like the readme implies.
It is hype-compatible so it is good.
It is AI so it is good.
It is blockchain so it is good.
It is cloud so it is good.
It is virtual so it is good.
It is UML so it is good.
It is RPN so it is good.
It is a steam engine so it is good.
Yawn...
It's not.