FilterHN

Show HN: Open-Source Colab Notebooks to Implement Advanced RAG Techniques

98 points

by hbamoria

10 months ago

| past

| 5 comments

| github.com

| HN

Hey HN fam,

We’ve seen developers spend a lot of time implementing advanced RAG techniques from scratch.

While these techniques are essential for improving performance, their implementation requires a lot of effort and testing!

To help with this process, our team (Athina AI) has released Open-Source Advanced RAG Cookbooks.

This is a collection of ready-to-run Google Colab notebooks featuring the most commonly implemented techniques.

Please show us some love by starring the repo if you find this useful!

▲

Oras

10 months ago

[-]

One of the challenges I have with RAG is excluding table of contents, headers/footers and appendices from PDFs.

Is there a tool/technique to achieve this? I’m aware that I can use LLMs to do so, or read all pages and find identical text (header/footer), but I want to keep the page number as part of the metadata to ensure better citation on retrieval.

▲

prsdm

10 months ago

[-]

This might help you: https://github.com/langchain-ai/langchain/blob/master/cookbo...

▲

Oras

10 months ago

[-]

Thank you, this is a mix of OCR and LLM, I was thinking if there might be a library to avoid using that.

A better approach will be using Textract as it maintains the flow, such as if you have a table going across multiple pages.

Btw, tesseract is not that good in getting accurate data from tables. Use it with caution especially in financial context.

I have made an open source tool to show missing data from tesseract and easy ocr https://github.com/orasik/parsevision/

▲

prsdm

10 months ago

[-]

Nice I really liked it!

▲

jonathan-adly

10 months ago

[-]

I would check out vision models as a technique to go around OCR errors.

ColPali is the standard implementation & SOTA. Much better than OCR. We maintain a ready to go retrieval API that implements this: https://github.com/tjmlabs/ColiVara

▲

throwup238

10 months ago

[-]

You’ll need other heuristics for ToC and indices but headers/footers are easy to detect via n-gram deduplication. You’ll want to figure out some rolling logic to handle chapter changes though.

▲

ellisv

10 months ago

[-]

Headers/footers are also positional.

▲

jonathan-adly

10 months ago

[-]

I would strongly advise against people learning based on LangChain.

It is abstraction hell, and will set you back thousands of engineers hours the moment you want to do something differently.

RAG is actually very simple thing to do; just too much VC money in the space & complexity merchants.

Best way to learn is outside of notebooks (the hard parts of RAG is all around the actual product), and use as little frameworks as possible.

My preferred stack is a FastAPI/numpy/redis. Simple as pie. You can swap redis for pgVector/Postgres when ready for the next complexity step.

▲

ellisv

10 months ago

[-]

I'd like to hear more about this – both your reasoning against LangChain and suggestions for alternatives.

My experience with LangChain has been a mixed bag. On the one hand it has been very easy to get up and running quickly. Following their examples actually works!

Trying to go beyond the examples to mix and match concepts was a real challenge because of the abstractions. As with any young framework in a fast moving field the concepts and abstractions seem to be changing quickly, thus examples within the documentation show multiple ways to do something but it isn't clear which is the "right" way.

▲

jackmpcollins

10 months ago

[-]

I'd be really interested to hear what abstractions you would find useful for RAG. I'm building magentic which is focused on structured outputs and streaming, but also enables RAG [0], though currently has no specific abstractions for it.

[0] https://magentic.dev/examples/rag_github/

▲

pchangr

10 months ago

[-]

Those were exactly my thoughts.. however I haven’t been able to find much material on how to implement this without relying on LangChain.. do you know of any beginners material I could use to fill my gaps?

▲

dmezzetti

10 months ago

[-]

An alternative you can try is txtai (https://github.com/neuml/txtai).

RAG section: https://github.com/neuml/txtai?tab=readme-ov-file#retrieval-...

Disclaimer: I'm the primary developer

▲

jonathan-adly

10 months ago

[-]

I will do it - you are right. Lots of materials in the space is basically people selling their complex tools w/ learning as a lower priority

▲

memhole

10 months ago

[-]

Start with ignoring 90% of the stuff you read about and realize you’re only manipulating strings to send to an API.

▲

Jet_Xu

10 months ago

[-]

Interesting discussion! While RAG is powerful for document retrieval, applying it to code repositories presents unique challenges that go beyond traditional RAG implementations. I've been working on a universal repository knowledge graph system, and found that the real complexity lies in handling cross-language semantic understanding and maintaining relationship context across different repo structures (mono/poly).

Has anyone successfully implemented a language-agnostic approach that can: 1. Capture implicit code relationships without heavy LLM dependency? 2. Scale efficiently for large monorepos while preserving fine-grained semantic links? 3. Handle cross-module dependencies and version evolution?

Current solutions like AST-based analysis + traditional embeddings seem to miss crucial semantic contexts. Curious about others' experiences with hybrid approaches combining static analysis and lightweight ML models.

▲

krawczstef

10 months ago

[-]

+1 for vanilla code without LangChain.

▲

hbamoria

10 months ago

[-]

I believe you're looking for notebooks w/o Langchain. We plan to publish them in next few days :)

▲

imworkingrn

10 months ago

[-]

whats wrong with langchain ?

▲

ErikBjare

10 months ago

[-]

I haven't used it in a year, but my experience was it frequently broke in all sorts of ways. I have since avoided it like the plague.

▲

imworkingrn

10 months ago

[-]

I hear you. Had the same experience. It's matured a lot since then though. Got back to it a few weeks ago and it feels surprisingly stable.

▲

chompychop

10 months ago

[-]

Does it still have the "abstraction hell" issue when trying to work with it for custom, non out-of-the-box use cases?

▲

prsdm

10 months ago

[-]

it's much more stable now.

▲

sauwan

10 months ago

[-]

Does it still put you in dependency hell though, where you can't add new packages without causing tons of version conflicts?

▲

efriis

10 months ago

[-]

Howdy! Erick from LangChain here. If anyone is seeing version conflicts on particular packages, please let me know!

These usually stem from overly strict constraints in the underlying sdks for the integrations, and in general we've been pretty successful asking for those constraints to be loosened. The main "problem" constraint we've seen in the past has been on httpx. Curious if you've seen others!

▲

chompychop

10 months ago

[-]

Huh? All of their notebooks use LangChain.

▲

dmezzetti

10 months ago

[-]

Thanks for sharing.

If you want notebooks that do some of this with local open models: https://github.com/neuml/txtai/tree/master/examples and here: https://gist.github.com/davidmezzetti

▲

prsdm

10 months ago

[-]

Thanks for sharing these resources! We’ll definitely take a look.