ScribeOCR – Web interface for recognizing text, OCR, & creating digitized docs
66 points
3 days ago
| 5 comments
| github.com
| HN
fodkodrasz
2 hours ago
[-]
I really like the idea, but unfortunately it could not cope with my usecase.

I have some lecture slides as image-only PDF (Hungarian language with a sparkle of English and Latin (biology)). I tried the tool on it and I had the following experience:

- proofreading with the overlay seems like a good idea, actually it is unusable when the original text has colors, and you need to recognize diacritic marks. Being able to show the original in grayscale or black&white could help. (BW worked, but Grayscale left everything colored)

- For proofreading the ebook mode was the most useful, I immediately spotted lots of errors that I could not see with overlay. A quick switch between the modes would be useful

- Editing text is not efficient when error rate is high (Hungarian language is not supported, that caused it mostly I guess), the interface has high overhead for mass corrections.

Very good idea, I think after a little polish it would even fit my usecase. For more traditional OCR usecases than mine it is probably already great.

reply
zihotki
3 hours ago
[-]
According to what I read in the documentation, it uses Tesseract underneath. I've used Tesseract v3 in the past and it was pain. Tesseract 4 uses LSTM neural net. How good is the performance and quality of the recognition nowadays in v4? Could anyone share his experience?
reply
graynk
1 hour ago
[-]
I use paperless-ngx for digitizing all my documents, it also uses Tesseract. The result is not perfect, but more than acceptable, if I scan at 600dpi
reply
aidenn0
4 hours ago
[-]
This is my first encounter with Scribe.js; since I have many book scans I always try OCRing them when I see this. Compared to Tesseract (which is the best I have so far), it gets the words right slightly more, but the paragraph segmentation is many times worse. On a book where every paragraph is indented, it reliably decides two consecutive one-line paragraphs are the same paragraph, which is understandable, but a downgrade from Tesseract which gets the paragraph segmentation as correct as possible (It doesn't handle paragraphs that spanpage-breaks, since I'm feeding it one page at a time)
reply
zihotki
3 hours ago
[-]
Scribe is Tesseract. It uses tesseract.js which is a Web Assembly port of Tesseract. So they in theory should be equal. In practice custom settings or older versions could make a difference.
reply
Elucalidavah
3 hours ago
[-]
> Tesseract (which is the best I have so far)

Have you looked at EasyOCR?

reply
constantinum
2 hours ago
[-]
anyone looking for an ocr or text pre-processor that maintains the layout(tables, forms) try LLMWhisperer > https://pg.llmwhisperer.unstract.com/
reply
ranger_danger
6 hours ago
[-]
This is awesome. Only issue was I had to disable my JShelter extension because it would freeze the page using 100% CPU forever.
reply