After tons of trial and error—embedding huge datasets, mixing vector + text search, handling concurrency, and dodging hallucinations, I decided to document it all in a book. It’ll be live on Manning.com’s Early Access soon (March 27th). If you’re tackling large-scale RAG or have questions about my approach (the struggles, the successes), feel free to ask. I’m happy to share lessons, config ideas, or gotchas so you can avoid the pitfalls I hit along the way.
I hope all that helps, let me know if you have any other questions!
What tradeoffs? It is fast and accurate, but it does get expensive when you have over 50 million records.
Second, we use a search service, and vectors are treated as supplementary to the text search, so chunking doesn't matter as much. We will usually take an entire PDF page and embed that, no matter what structure the data on that page is. We do keep track of the name of the document and the page number. For SQL records, we just turn each record into a text string and embed that.
Our stack was just Python, Autogen for the agents, and as I mentioned Azure AI Search. We use Azure Web Apps for the backend, and OpenAI models for the generation. Great questions!
The main program is hosted on Azure Web Apps, the search is Azure AI Search, we use AutoGen for the agents, and we use OpenAI for the generation. Azure has a lot of tools that support AI and search, so we use those too.