- Qwen 2.5 VL (72b and 32b)
- Gemma-3 (27b)
- DeepSeek-v3-0324
And a couple weeks ago we got the new mistral-ocr model. We updated our OCR benchmark to include the new models.
We evaluated 1,000 documents for JSON extraction accuracy. Major takeaways:
- Qwen 2.5 VL (72b and 32b) are by far the most impressive. Both landed right around 75% accuracy (equivalent to GPT-4o’s performance). Qwen 72b was only 0.4% above 32b. Within the margin of error.
- Both Qwen models passed mistral-ocr (72.2%), which is specifically trained for OCR.
- Gemma-3 (27B) only scored 42.9%. Particularly surprising given that it's architecture is based on Gemini 2.0 which still tops the accuracy chart.
The data set and benchmark runner is fully open source. You can check out the code and reproduction steps here:
- https://getomni.ai/blog/benchmarking-open-source-models-for-...
Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. https://qwenlm.github.io/blog/qwen2.5-vl/
One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.
There's also a paper https://arxiv.org/pdf/2409.12191 where they explicitly say some of their training included bounding boxes and coordinates.
https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr...
I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.
Not sure if it matters but I exported a PDF page as a PNG with 200dpi resolution, and used that.
It seems like it's reading the text but getting the details wrong.
I would not be comfortable using this in an official capacity without more accuracy. I could see using this for words that another OCR system is uncertain about, though, as a fallback.
High level results were:
- Qwen 32b => $0.33/1000 pages => 53s/page
- Qwen 72b => $0.71/1000 pages => 51s/page
- Llama 90b => $8.50/1000 pages => 44s/page
- Llama 11b => $0.21/1000 pages => 08s/page
- Gemma 27b => $0.25/1000 pages => 22s/page
- Mistral => $1.00/1000 pages => 03s/page
E.g. if you look at https://openrouter.ai/models?order=pricing-high-to-low, you'll see that there are some 7B and 8B models that are more expensive than Claude Sonnet 3.7.
Their theory is they can raise prices once their competitors go out of business. The companies open-sourcing pretrained models are countering that. So, we see a mix of huge models underpriced by scheming companies and open-source models priced for inference with free market principles.
I think in order to run a proper cost comparison, we would need to run each model on an AWS gpu instance and compare the runtime required.
In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.
What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.
Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.
Is there an advantage of using an LLM here?
There's some comments I've run across saying Qwen2.5-VL's really good at handwriting recognition.
It'd also be interesting to see how Tesseract compares when trying to OCR more mixed text+graphic media. Some possible examples: high-design magazines with color backgrounds, TikTok posts, maps, cardboard hold-up signs at political gatherings.
Overall, it's very impressive, but makes some mistakes (on easy images - i.e. obviously wrong) that require human intervention.
I would like to compare it to these models, but this benchmark is beyond OCR - extracted structured JSON.
(I would recommend the latter)
I have a prompt which works for a single file in Copilot, but it's slower than opening the file and looking at it to find one specific piece of information and re-saving it manually and then running a .bat file to rename with more of the information, then filling out the last two bits when entering things.
Let me rephrase:
What locally-hosted LLM would be suited to batch processing image files?
All of these models are open source (I think?). They could presumably build their work on any of these options. It behooves them to pick well. And establish some authority along the way.
(It is not a joke.)
If they (all of the mentioned ones) are open source and can be ran locally, then most likely, yes.
From what I remember, they are all local and open source, so the answer is yes, if I am correct.
Update: looks like the removed themselves from the graph since I saw it earlier today!
The beauty of version control: https://github.com/getomni-ai/benchmark/commit/0544e2a439423...