FilterHN

Show HN: SpRAG – Open-source RAG implementation for challenging real-world tasks

69 points

16 days ago

| 9 comments

Hey HN, I’m Zach from Superpowered AI (YC S22). We’ve been working in the RAG space for a little over a year now, and we’ve recently decided to open-source all of our core retrieval tech.

spRAG is a retrieval system that’s designed to handle complex real-world queries over dense text, like legal documents and financial reports. As far as we know, it produces the most accurate and reliable results of any RAG system for these kinds of tasks. For example, on FinanceBench, which is an especially challenging open-book financial question answering benchmark, spRAG gets 83% of questions correct, compared to 19% for the vanilla RAG baseline (which uses Chroma + OpenAI Ada embeddings + LangChain).

You can find more info about how it works and how to use it in the project’s README. We’re also very open to contributions. We especially need contributions around integrations (i.e. adding support for more vector DBs, embedding models, etc.) and around evaluation.

▲

bashtoni

15 days ago

[-]

Interested to see how this performs against RAPTOR, which does summarisation and clustering.

https://github.com/profintegra/raptor-rag https://github.com/langchain-ai/langchain/blob/master/cookbo...

▲

esafak

15 days ago

[-]

I'd replace the "challenging real-world tasks" in the title with "dense text, like financial reports and legal documents". It sounds less general but that's a good thing.

The repo is only two weeks old, and looks it, so how do you think spRAG distinguishes itself? This is a crowded space with more established players.

The "vanilla RAG" benchmark figure you cite is not convincing because it can not be verified. Please share your benchmarking code.

▲

zmccormick7

15 days ago

[-]

That's great feedback. I actually went back and forth between those two descriptions. I agree that "dense text, like financial reports and legal documents" is more precise. Those are the kinds of use cases this project is built for.

I want to keep this project tightly scoped to just retrieval over dense unstructured text, rather than trying to build a fully-featured RAG framework.

▲

_akhe

15 days ago

[-]

FWIW nobody has created a great JavaScript framework experience yet. Closest we have for RAG is LlamaIndexTS or Langchainjs but both are full of bugs and have little LLM support. Their whole approach to supporting LLMs is writing bespoke wrappers for each.

Maybe it's the same on the Python side, but it feels like nobody has nailed the perfect LLM wrapper library yet. I would focus on dev experience - make it dead simple to load in files and use it from 0-1.

▲

skanga

15 days ago

[-]

You mentioned that spRAG uses OpenAI for embeddings, Claude 3 Haiku for AutoContext, and Cohere for reranking. Can you explain why & how did you make those choices?

▲

zmccormick7

15 days ago

[-]

Those are just the defaults, and spRAG is designed to be flexible in terms of the models you can use with it. For AutoContext (which is just a summarization task) Haiku offers a great balance of price and performance. Llama 3-8B would also be a great choice there, especially if you want something you can run locally. For reranking, the Cohere v3 reranker is by far the best performer on the market right now. And for embeddings, it's really a toss-up between OpenAI, Cohere, and Voyage.

▲

Cheer2171

15 days ago

[-]

I bet you'll get a lot more adoption if you put info about using it with local self-hosted LLMs there. I'll never trust a cloud service with the documents I want to RAG.

▲

zmccormick7

15 days ago

[-]

Agreed. I've gotten a lot of feedback along those lines today, so that's my top priority now.

▲

serjester

15 days ago

[-]

Does this mean you're winding down your business? Just curious what the motivation to open source this was given this seems like your guy's core value add? Congrats on the launch.

▲

zmccormick7

15 days ago

[-]

That's a great question. I'll start with a little context: most of the users of our existing hosted platform are no-code/low-code developers who choose us because we're the simplest solution for building what they want to build (primarily because we have end-to-end workflows like Chat built-in). The improved retrieval performance is a nice-to-have for this group, but usually not the primary reason they choose us.

For the larger companies and venture-backed startups we've talked to, they almost universally want to own their RAG stack in-house or build on open-source frameworks, rather than outsource it to an end-to-end solution provider. So open-sourcing our core retrieval tech is our bid to appeal to these developers.

▲

uptownfunk

15 days ago

[-]

Uh, isn’t there a huge venture backed startup that rhymes with spleen that contradicts this? Not saying you’re wrong but.. one of you is.

▲

zmccormick7

15 days ago

[-]

I think the difference is that they're building for end users, not developers.

▲

bitshaker

15 days ago

[-]

Amazing. I’m looking at building an app to look over the employment sections of the legal code and come back with results if things are allowed or not.

Answering questions like:

Can my employer do X? As an employee in this country, what’s my minimum days off I can take?

And so on.

If your claims are true, then this will be exactly what I’m looking for.

▲

zmccormick7

15 days ago

[-]

I think spRAG should be pretty well suited for that use case. I think the biggest challenge will be generating specific search queries off of more general user inputs. You can look at the `auto_query.py` file for a basic implementation of that kind of system, but it'll likely require some experimentation and customization for your use case.

▲

jwuphysics

15 days ago

[-]

How much do you expect auto-context and clustering+re-ranking to help for cases in which documents already have high-quality summaries? For context, I parse astrophysics research papers from arXiv and simply embed by paper abstracts (which must be of a certain size), and then append (parts of) the rest of the paper for RAG.

▲

zmccormick7

15 days ago

[-]

So the point of AutoContext is so you don't have to do that two-step process of first finding the right document, and then finding the right section of that document. I think it's cleaner to do it this way, but it's not necessarily going to perform any better or worse. But then spRAG also has the RSE part which is what identifies the right section(s) of the document. Whether or not that helps in your case is going to depend on how good of a solution you already have for that.

▲

jwuphysics

15 days ago

[-]

That makes sense and I'll run a few evals. Many thanks for open sourcing your work!

▲

cyanydeez

15 days ago

[-]

Im planning a RAG and this seems to implement the "autocontext" i was expectong to do.

For the library i have, its not just the file name but multiple folder names are descriptive, especially if i build a data dictionary.

Have you looked into some simple tagging/conversion dictionary that preproccess the context?

▲

zmccormick7

15 days ago

[-]

In our AutoContext implementation, the document title gets included with the generated summary. So if you have files that are organized into nested folders with descriptive names, you can input that full file path as the `document_title`. I did this with one of our internal benchmarks and it worked really well.

▲

syndacks

14 days ago

[-]

Hi Zach, how do you think this architecture would perform for one longer document i.e. a novel of >50k words <100k? the queries would be about that one long document as opposed to multiple documents. any tips on how to approach my use case? thanks

▲

zmccormick7

13 days ago

[-]

That should work well with the default parameters, so you shouldn't have to do anything special.

▲

TheAnkurTyagi

15 days ago

[-]

nice to know but when you say "challenging real-world tasks" any use cases?

are you automating the end to end RAG pipeline?

▲

zmccormick7

15 days ago

[-]

That description is a little vague, so I need to improve that. The use cases we're focused on are ones with 1) dense unstructured text, like legal documents, financial reports, and academic papers; and 2) challenging queries that go beyond simple factoid question answering. Those kinds of use cases are where we see existing RAG systems struggle the most.