FilterHN

RAG Without Vectors – PageIndex: Reasoning-Based Document Indexing

11 points

1 day ago

| 3 comments

We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.

We open-sourced one of the key components: PageIndex, a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.

Some highlights:

- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.

- Precise Referencing: Each node includes a summary and exact physical page numbers.

- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.

We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.

Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!

▲

bsenftner

9 hours ago

[-]

Very interesting work. What are your opinions of GraphRAG and the variations?

I'm currently evaluating systems that extend RAG as your PageIndex project does, with an eye on adaptability to new information.

A good portion of my work involves legal issues and case law, and with the US going through a lot of legal transformations with the new administration, I am seeking a system that can ingest new information that imposes new rules on the handling of information, and those new rules need to impose precedence over any similar such rules already in the knowledgebase.

This new information ingestion and logical resolution within the larger knowledgebase needs to be efficient too. The initial GraphRAG is expensive to begin with, and does not appear to have any optimized handling for ingesting of new, conflicting information. The GraphRAG variants that are getting a lot of attention now appear to be addressing the lack of efficiency in the original GraphRAG implementation. Where does PageIndex set within this group of similar offerings?

▲

Imanari

5 hours ago

[-]

Interesting work! How do you construct the relationship between nodes if not all documents fit into context?

▲

vectify_AI

1 day ago

[-]

GitHub repo: https://github.com/VectifyAI/PageIndex/ Open to feedback and suggestions.