Show HN: SeekStorm – open-source sub-millisecond search in Rust
245 points
24 days ago
| 17 comments
| github.com
| HN
throwaway888abc
23 days ago
[-]
Impressive, bookmarked, upvoted.

Appreciate the demo: https://deephn.org/?q=apple+silicon

reply
0bit
23 days ago
[-]
Counter-example: https://deephn.org/?q=embeddings Contrast with https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

It is fast but, nowhere close to accurate or useful for this specific example. Could not find a way to force the plural form. Neither quotes nor plus worked.

reply
d3Xt3r
22 days ago
[-]
Seems to buggy. According to the SeekStorm github, it's supposed to support boolean operators right? But they don't seem to work.

Eg: https://deephn.org/?q=Linux+OR+KDE

reply
treefarmer
23 days ago
[-]
Is there distributed server support? I see it on the list of new features with (currently PoC) next to it, but is the code for the PoC available anywhere?

Also, would there be any potential issues if the index was mounted on shared storage between multiple instances?

reply
wolfgarbe
23 days ago
[-]
The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.

As for shared storage, do you mean something like NAS or, rather Amazon S3? Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.

reply
jszymborski
22 days ago
[-]
S3 support would be absolutely killer.
reply
justmarc
23 days ago
[-]
I really like your approach. Impressed by your care for performance and your fast pace of adding what appears to be pretty complex stuff, while making sure it stays performant.

Keep it up!

Bookmarked.

reply
infamouscow
23 days ago
[-]
I'm not sure it's a good idea to use mmap for this.

https://db.cs.cmu.edu/mmap-cidr2022/

reply
wolfgarbe
23 days ago
[-]
In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former. https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...
reply
nextaccountic
19 days ago
[-]
Does seekstorm use io_uring? Could io_uring lower load time here?

Or at least lazy loading of index in RAM (emulating what mmap would do anyway)

reply
wolfgarbe
12 days ago
[-]
SeekStorm does currently not use io_uring, but it is on our roadmap. Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.

It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.

But I would expect that the mmap implementations do already use io_uring / IoRing.

Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.

reply
remram
23 days ago
[-]
What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.
reply
jazzyjackson
23 days ago
[-]
Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.

I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.

https://github.com/ChatScript/ChatScript

Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.

reply
kreyenborgi
23 days ago
[-]
Their company home page, http://brilligunderstanding.com/ wow..
reply
wolfgarbe
23 days ago
[-]
We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.
reply
ronjakoi
23 days ago
[-]
I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.
reply
wolfgarbe
23 days ago
[-]
Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.
reply
tlofreso
23 days ago
[-]
Demo = impressed.

How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...

reply
wolfgarbe
23 days ago
[-]
Yes, integration in complex legacy systems is always challenging. As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure. As SeekStorm is open-source, system integrators can take it from there.
reply
fiedzia
23 days ago
[-]
Same as any other full-text search solution - it's your job to integrate it.
reply
m348e912
23 days ago
[-]
>Demo = impressed.

How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.

Never mind, found that someone posted a link already.

reply
jazzyjackson
23 days ago
[-]
On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.
reply
CharlieDigital
23 days ago
[-]
Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.

What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.

[0] https://learn.microsoft.com/en-us/azure/ai-services/document...

[1] https://learn.microsoft.com/en-us/azure/ai-services/document...

reply
jazzyjackson
23 days ago
[-]
Peculiar, Thanks!
reply
faizshah
23 days ago
[-]
This is really impressive, I would suggest benchmarking it against Vespa as well I have gotten better perf results from Vespa than Lucerne/Solr/ES.

I’ll take a try this weekend as well.

reply
Leoko
24 days ago
[-]
Sub-millisecond latency sounds impressive, but isn't network latency going to overshadow these gains in most real-world scenarios?
reply
pornel
23 days ago
[-]
When search is cheap and quick, it's possible to improve search by postprocessing search results and running more queries when necessary.

I use Tantivy, and add refinements like: if the top result is objectively a low-quality one, it's usually a query with a typo finding a document with the same typo, so I run the query again with fuzzy spelling. If all the top results have the same tag (that isn't in the query), then I mix in results from another search with the most common tag excluded. If the query is a word that has multiple meanings, I can ensure that each meaning is represented in the top results.

reply
wolfgarbe
24 days ago
[-]
It depends on the application.

When using SeekStorm as a server, keeping the latency per query low increases the throughput and the number of parallel queries a server can handle on top of a given hardware. An efficient search server can reduce the required investments in server hardware.

In other cases, only the local search performance matters, e.g., for data mining or RAG.

Also, it's not only about averages but also about tail latencies. While network latencies dominate the average search time, that is not the case for tail latencies, which in turn heavily influence user satisfaction and revenue in online shopping.

reply
intelVISA
23 days ago
[-]
A typical server is serving more than one request at a time, hopefully.
reply
llIIllIIllIIl
23 days ago
[-]
How is it different from Meilisearch[1]? I’m running search for my small multi tenant SaaS and self hosted Meilisearch gives me grief like any relatively new tech, so I’m shopping for new solutions.

1: https://www.meilisearch.com/

reply
J_Shelby_J
22 days ago
[-]
Well, off the bat, this seems to be able to embedded directly into your rust project without the need for a standalone server.
reply
wiradikusuma
23 days ago
[-]
Could you share more about your experience with Meilisearch?
reply
llIIllIIllIIl
23 days ago
[-]
Tl;dr: 4/5 stars for hobbit software SaaS.

—————

Full version: I run it on a dedicated machine 2vcpu2gb on digital ocean. Every tenant has an index and i have like 30k searches per week across all tenants. Each tenant has from 1 to 150k documents in their index. Sentry catches MeilisearchTimeoutException couple times every day with the message that Meilisearch could not finish adding document to index. I don’t care too much about that because background worker is responsible for updating index, so that tasks gets rescheduled. I like to keep my sentry clean, so it’s more an inconvenience than the issue. Meilisearch setup is very straightforward, they provide client libraries for almost all languages (maybe even for esoteric and marginal, idk, i only need python), have pretty decent documentation covering the basics and don’t really require operations at my scale. I really liked the feature of issuing the limited access tokens to be able to set the pre condition. That’s how i limit the searches for particular user on the tenant to see only their data.

reply
dantodor
23 days ago
[-]
Interesting approach, would love to see a comparison with Typesense
reply
ghita_
23 days ago
[-]
Very impressive results. I'm curious how you benchmarked against bm25 in terms of accuracy? I couldn't find metrics around that, just one search example. I think there are use cases where latency is king, but when it comes to vector search / hybrid search accuracy is probably more important.
reply
wolfgarbe
22 days ago
[-]
For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.

For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.

Systematic relevancy benchmarks like BeIR, MS MARCO are planned.

reply
ghita_
15 days ago
[-]
got it - i think the anecdotal evidence is what intrigued me a little bit looking forward to seeing the systematic relevancy benchmarks
reply
littlestymaar
23 days ago
[-]
I don't know how fair the benchmark is, but beating Tantivy by that margin is impressive to say the least.

Any plan to make it run on WASM? I wanted to add this feature to Tantivy a few years ago but they weren't interested, and I had to fall back to a JavaScript search engine that was much slower.

reply
fulmicoton
23 days ago
[-]
Developer of tantivy chiming in! (I hope that's ok) Database performance is a space where there are a lot of lies and bullshit, so you are 100% right to be suspicious.

I don't know SeekStorm's team and I did not dig much into the details, but my impression so far is that their benchmark's results are fair. At least I see no reason not to trust them.

reply
PSeitz
23 days ago
[-]
Also we are working on some performance improvements based on the benchmark comparison, as they highlighted some areas we can improve in tantivy.
reply
wolfgarbe
23 days ago
[-]
The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.

Yes, WASM and Python bindings are on our roadmap.

reply
Thaxll
23 days ago
[-]
It feels like everyone re-implement the same application, searching text in language x.y.z has been done a million times, search speed in not a problem so what differenciate this solution with the dozen+ mature ones.

The speed looks great but isn't everything else already fast enough?

reply
wolfgarbe
23 days ago
[-]
Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.
reply
sdesol
23 days ago
[-]
In addition to what you said, faster searches can also provide different search options. For example, if you can execute five similar searches in the same time that it would take to execute one. You now have the option to ask "Can I leverage five similar searches to produce better results" and if the answer is yes, you can now provide better answers and still keep the same user experience.

Where I really think faster searches will come into play is with AI. There is nothing energy efficient about how LLM work and I really think Enterprise will focus on using LLM to generate as many Q and A pairs during off peak energy hours and using a hybrid search that can bridge semantic (vector) and text. I think for Enterprise the risk of hallucinations (even with RAG) will be too great and fall back to traditional search, but with a better user experience.

Based on the README, it looks like vector search is not supported or planned, but it would be interesting to see if SeekStorm can do this more efficiently than Lucene/OpenSearch and others. I only dabbled in the search space, so I don't know how complex this would be, but I think SeekStorm can become a killer search solution if it can support both.

Edit: My bad, it looks like vector search is PoC.

reply
jamil7
23 days ago
[-]
Software is currently extremely inefficient, driven by years of increasingly powerful cheap hardware. Once that starts to slow it makes sense that we start squeezing efficiency out of software again. We’ve also seen in the last 20 years the rise of languages that make writing performant, higher-level software a lot easier.

We’re also at a point where cloud compute is consuming a significant amount of energy globally.

reply
emmanueloga_
23 days ago
[-]
The documentation seems a bit sparse. Also, I couldn't find binaries so I'm guessing building from source is required at the moment?

I'm curious about the binary size of it all. Could this be compiled with WASM and run on static pages?

reply
wolfgarbe
16 days ago
[-]
>> The documentation seems a bit sparse.

We just released a new OpenAPI based documentation for the SeekStorm server REST API: https://seekstorm.apidocumentation.com

For the library we have the standard rust doc: https://docs.rs/seekstorm/latest/seekstorm/

reply
wolfgarbe
23 days ago
[-]
The Seekstorm library is 9 MB, and the Seekstorm server executable is 8 MB, depending on the features selected in cargo.

You add the library via 'cargo add seekstorm' to your project which you anyway have to compile.

As for the server, we may add binaries for the main OS in the future.

WASM and Python bindings are on our roadmap.

reply
bosky101
21 days ago
[-]
Why isn't there a http interface? That would increase adoption by far
reply
wolfgarbe
16 days ago
[-]
SeekStorm comes with an http interface.

The SeekStorm server features an REST API via http: https://seekstorm.apidocumentation.com

It also comes with an embedded Web UI: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#bu...

Or did you mean a Web based interface to create and manage indices, define index schemas, add documents etc?

reply
athompsondog
23 days ago
[-]
I wonder how burnt sushi feels about this
reply
distracted_boy
23 days ago
[-]
How does this compare to PostgreSQL?
reply
wolfgarbe
23 days ago
[-]
PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.

Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.

reply
yellow_lead
23 days ago
[-]
When I tried to use FTS in Postgres, I got terrible performance, but maybe I was doing something wrong. I'm using Meili now.
reply
anonzzzies
23 days ago
[-]
Same here, this would easily beat it as far as I have seen, but maybe I did something wrong.
reply
philippemnoel
23 days ago
[-]
ParadeDB (paradedb.com) is similar to this HN, but baked into Postgres to solve this very problem you are describing
reply