Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.
The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.
Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.
Plus it means you don't have to solve the chunking problem!
For most cases though sticking with BM25 is likely to be "good enough" and a whole lot cheaper to build and run.
Anthropic found embeddings + BM25 (keyword search) gave the best results. (Well, after contextual summarization, and fusion, and reranking, and shoving the whole thing into an LLM...)
But sadly they didn't say how BM25 did on its own, which is the really interesting part to me.
In my own (small scale) tests with embeddings, I found that I'd be looking right at the page that contained the literal words in my query and embeddings would fail to find it... Ctrl+F wins again!
Embeddings just aren’t the most interesting thing here if you’re running a frontier fm.
Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this.
Why?
- developer oriented (easy to read Python and uses pydantic-ai)
- benchmarks available
- docling with advanced citations (on branch)
- supports deep research agent
- real open source by long term committed developer not fly by night
But I'm lazy and assumed that someone has already built such a thing. I'm just not aware of this "Wikipedia-RAG-in-a-box".
Back in 2023 when I compared semantic search to lexical search (tantivy; BM25), I found the search results to be marginally different.
Even if semantic search has slightly more recall, does the problem of context warrant this multi-component, homebrew search engine approach?
By what important measure does it outperform a lexical search engine? Is the engineering time worth it?
Its very dependent on use case imo
Similarly, I used sqlite-vec, and was very happy with it. (if I were already using postgres I'd have gone with that, but this was more of a cli tool).
If the author is here, did you try any of those models? how would you compare the ones you did use?
Even starting with having "just" the documents and vector db locally is a huge first step and much more doable than going with a local LLM at the same time. I don't know any one or any org that has the resources to run their own LLM at scale.
So for sure any medium sized company could afford to run their own LLMs, also at scale if they want to make the investment. The question is, how much they value their confidential data. (I would not trust any of the big AI companies). And you don't usually need cutting edge reasoning and coding abilities to process basic information.
I just put this example together today: https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efd...