What it does:
- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition
Just run npm install documind to get started.
1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.
Some issues with this approach:
* OpenAI may retain and use your data for training, raising privacy concerns [1].
* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.
* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.
* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!
While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.
---
1: https://platform.openai.com/docs/models#how-we-use-your-data
If you inspect the source code, it's a verbatim copy. They literally just renamed the ZeroxOutput to DocumindOutput [2][3]
[1] https://github.com/getomni-ai/zerox
[2] https://github.com/DocumindHQ/documind/blob/main/core/src/ty...
[3] https://github.com/getomni-ai/zerox/blob/main/node-zerox/src...
It’s a pretty unethical behavior if what you describe is the full story and as a user of many open source projects how can one be aware of this type of behavior?
I think both sides here can learn from this, copyright notices are technically not required but when some text references them it is very useful. The original author should have added one. The user of the code could also have asked about the copyright. If this were to go to court having the original license not making sense could create more questions than it should.
tl;dr: add a copyright line at the top of the file when you’re using the MIT license.
If there's any additional thing I can do, please let me know so I would make all amendements immediately.
You're going to have to delete this thing and start over man.
If you're looking for an all-in-one solution, little plug for our new platform that does the above and also allows you to create custom 'patterns' that get picked up via semantic search. Uses open-source models by default, can deploy into your internal network. www.datafog.ai. In beta now and onboarding manually. Shoot me an email if you'd like to learn more!
"Traditional methods (PDF parsers with OCR support) are cheaper, more reliable"
Not sure on the reliability - the ones I'm using all fail at structured data. You want a table extracted from a PDF, LLMs are your friend. (Recommendations welcome)
Documind is using https://api.openai.com/v1/chat/completions, check the docs at the end of the long API table [1]:
> * Chat Completions:
> Image inputs via the gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention."
--
1: https://platform.openai.com/docs/models#how-we-use-your-data
It's still not used for training, though, and the retention period is 30 days. It's... a livable compromise for some(many) use cases.
I kind of get the abuse policy reason for image inputs. It makes sense for multi-turn conversations to require a 1h audio retention, too. I'm just incredibly puzzled why schemas for structured outputs aren't eligible for zero-retention.
https://news.ycombinator.com/item?id=42178413
You may wanna get ahead of this because the evidence is fairly damning. Failing to even give credit to the original project is a pretty gross move.
I made sure to copy and past the MIT license in Zerox exactly as it was into the folder of the code that uses it. I also included it in the main license file as well. If there's anything I could do to make corrections please let me know so I'd change that ASAP.
In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.
Do they work for Bills of Lading yet? When I tested a sample of these bills a few years back (2022 I think), the results were not good at all. But I honestly wouldn't be surprised if they'd massively improved lately.
Otherwise it seems like a prompt building tool, or am I missing something here?
I see someone opened an issue for it so will fix now.
However, if you process, say, 1 million documents, you could sample and review a small percentage of them manually (a power calculation would help here). Assuming your random sample models the "distribution" (which may be tough to define/summarize) of the 1 million documents, you could then extrapolate your accuracy onto the larger set of documents without having to review each and every one.
What I've noticed, that on scanned documents, where stamp-text and handwriting is just as important as printed text, Gemini was way better compared to chat gpt.
Of course, my prompts might have been an issue, but gemini with very brief and generic queries made significantly better results.
Alas, i am let down. It is an open-source tool creating the prompt for the OpenAI API and i can't go and send customer data to them.
I'm aware of https://github.com/clovaai/donut so i hoped this would be more like that.
https://github.com/DocumindHQ/documind/blob/d91121739df03867...
* Run locally or on premise for security/privacy reasons
* Support multiple LLMs and vector DBs - plug and play
* Support customisable schemas
* Method to check/confirm accuracy with source
* Cron jobs for automation
There is Unstract that solves the above requirements.
However, my main issue is that I need to work with confidential client data that cannot be uploaded to a third party. Setting up the open-source, locally hosted version of Unstructured was quite cumbersome due to the numerous additional packages and installation steps required.
While I’m open to the idea of parsing content with an LLM that has vision capabilities, data safety and confidentiality are critical for many applications. I think your project would go from good to great if it would be possible to connect to Ollama and run locally,
That said, this is an excellent application! I can definitely see myself using it in other projects that don’t demand such stringent data confidentiality.”
PDFQuery
PyMuPDF (having more success with older versions, right now)
If you're dealing with unstructured data trapped in PDFs, Documind might be the tool you’ve been waiting for. It’s an open-source solution that simplifies the process of turning documents into clean, structured JSON data with the power of AI.
Key Features: 1. Customizable Data Extraction Define your own schema to extract exactly the information you need from PDFs—no unnecessary clutter.
2. Simple Input, Clean Output Just provide a PDF link and your schema definition, and it returns structured JSON data, ready to integrate into your workflows.
3. Developer-Friendly With a simple setup (`npm install documind`), you can get started right away and start automating tedious document processing tasks.
Whether you’re automating invoice processing, handling contracts, or working with any document-heavy workflows, Documind offers a lightweight, accessible solution. And since it’s open-source, you can customize it further to suit your specific needs.
Would love to hear if others in the community have tried it—how does it stack up for your use cases?
I haven't had issues with hallucinations. If you're interested, my email is in my bio.
const systemPrompt = `
Convert the following PDF page to markdown.
Return only the markdown with no explanation text. Do not include deliminators like '''markdown.
You must include all information on the page. Do not exclude headers, footers, or subtext.
`;
enthusiastically setting up a lounge chair
> OPENAI_API_KEY=your_openai_api_key
carrying it back apathetically