We open-sourced one of the key components: PageIndex, a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.
Some highlights:
- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.
- Precise Referencing: Each node includes a summary and exact physical page numbers.
- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.
We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.
Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!
I'm currently evaluating systems that extend RAG as your PageIndex project does, with an eye on adaptability to new information.
A good portion of my work involves legal issues and case law, and with the US going through a lot of legal transformations with the new administration, I am seeking a system that can ingest new information that imposes new rules on the handling of information, and those new rules need to impose precedence over any similar such rules already in the knowledgebase.
This new information ingestion and logical resolution within the larger knowledgebase needs to be efficient too. The initial GraphRAG is expensive to begin with, and does not appear to have any optimized handling for ingesting of new, conflicting information. The GraphRAG variants that are getting a lot of attention now appear to be addressing the lack of efficiency in the original GraphRAG implementation. Where does PageIndex set within this group of similar offerings?