Not quite. Serverless means you can run a server permanently, but you need pay someone else to manage the infrastructure for you.
https://github.com/zai-org/GLM-OCR
(Shameless plug: I also maintain a simplified version of GLM-OCR without dependency on the transformers library, which makes it much easier to install: https://github.com/99991/Simple-GLM-OCR/)
I do agree with the use of serverless though. I feel like we agree long ago that serverless just means that you're not spinning up a physical or virtual server, but simply ask some cloud infrastructure to run your code, without having to care about how it's run.
Low LoC count is a telltale sign that the project adds little to no value. It's a claim that the project integrates third party services and/or modules, and does a little plumbing to tie things together.
'Serverless' has become a term of art: https://en.wikipedia.org/wiki/Serverless_computing
> Serverless is a misnomer
But this caught me for a bit as well. :-)
I use carless transportation (taxis).
ocrarena.ai maintains a leaderboard, and a number of other open source options like dots [1] or olmOCR [2] rank higher.
My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.
There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.
#!/usr/bin/env bash
# requires: tesseract-ocr imagemagick maim xsel
IMG=$(mktemp)
trap "rm $IMG*" EXIT
# --nodrag means click 2x
maim -s --nodrag --quality=10 $IMG.png
# should increase detection rate
mogrify -modulate 100,0 -resize 400% $IMG.png
tesseract $IMG.png $IMG &>/dev/null
cat $IMG.txt | xsel -bi
notify-send "Text copied" "$(cat $IMG.txt)"
exit> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).
That... doesn't sound legal
I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.
Where it falls apart is complex pages. Multi-column layouts, tables, equations, handwriting. Tesseract works line-by-line with no understanding of page structure, so a two-column paper gets garbled into interleaved text. VLM-based models like DeepSeek treat the page as an image and infer structure visually, which handles those cases much better.
For this specific use case (stats textbook with heavy math), Tesseract would really struggle with the equations. LaTeX-rendered math has unusual character spacing and stacked symbols that confuse traditional OCR engines. The author chose DeepSeek specifically because it outputs markdown with math notation intact.
The tradeoff is cost and infrastructure. Tesseract runs on your laptop for free. The author spent $2 on A100 GPU time for 600 pages. For a one-off textbook that's nothing, but at scale the difference between "free on CPU" and "$0.003/page on GPU" matters. Worth noting that newer alternatives like dots and olmOCR (mentioned upthread by kbyatnal) are also worth comparing if accuracy on complex layouts is the priority.
I have 4 of these now, some are better than others. But all worked great.
step 1 draw a circle
step 2 import the rest of the owl