spRAG is a retrieval system that’s designed to handle complex real-world queries over dense text, like legal documents and financial reports. As far as we know, it produces the most accurate and reliable results of any RAG system for these kinds of tasks. For example, on FinanceBench, which is an especially challenging open-book financial question answering benchmark, spRAG gets 83% of questions correct, compared to 19% for the vanilla RAG baseline (which uses Chroma + OpenAI Ada embeddings + LangChain).
You can find more info about how it works and how to use it in the project’s README. We’re also very open to contributions. We especially need contributions around integrations (i.e. adding support for more vector DBs, embedding models, etc.) and around evaluation.
https://github.com/profintegra/raptor-rag https://github.com/langchain-ai/langchain/blob/master/cookbo...
The repo is only two weeks old, and looks it, so how do you think spRAG distinguishes itself? This is a crowded space with more established players.
The "vanilla RAG" benchmark figure you cite is not convincing because it can not be verified. Please share your benchmarking code.
I want to keep this project tightly scoped to just retrieval over dense unstructured text, rather than trying to build a fully-featured RAG framework.
Maybe it's the same on the Python side, but it feels like nobody has nailed the perfect LLM wrapper library yet. I would focus on dev experience - make it dead simple to load in files and use it from 0-1.
For the larger companies and venture-backed startups we've talked to, they almost universally want to own their RAG stack in-house or build on open-source frameworks, rather than outsource it to an end-to-end solution provider. So open-sourcing our core retrieval tech is our bid to appeal to these developers.
Answering questions like:
Can my employer do X? As an employee in this country, what’s my minimum days off I can take?
And so on.
If your claims are true, then this will be exactly what I’m looking for.
For the library i have, its not just the file name but multiple folder names are descriptive, especially if i build a data dictionary.
Have you looked into some simple tagging/conversion dictionary that preproccess the context?
are you automating the end to end RAG pipeline?