VectorChord: Store 400k Vectors for $1 in PostgreSQL
156 points
21 days ago
| 12 comments
| blog.pgvecto.rs
| HN
dmezzetti
20 days ago
[-]
This looks like an interesting project.

Though it's worth noting that the license is AGPL. So if the idea is for this to take over for pgvecto.rs, it's an important data point for those building SaaS products.

It will make pgvector the only permissively licensed option, given it has the same license as Postgres.

reply
gaocegege
15 days ago
[-]
Hi, now we adopted a dual license model (AGPLv3 and ELv2).
reply
whakim
20 days ago
[-]
Could you talk about how updates are handled? My understanding is that IVF can struggle if you're doing a lot of inserts/updates after index creation, as the data needs to be incrementally re-clustered (or the entire index needs to be rebuilt) in order to ensure the clusters continue to reflect the shape of your data?
reply
VoVAllen
20 days ago
[-]
We don’t perform any reclustering. As you said, users would need to rebuild the index if they want to recluster. However, based on our observations, the speed remains acceptable even with significant data growth. We did a simple experiment using nlist=1 on the GIST dataset, the top-10 retrieval results took less than twice the time compared to using nlist=4096. This is because only the quantized vectors (with a 32x compression) need to be inserted into the posting list, and only quantized vector distances need more computations. And the quantized vector computation only accounts for a small amount of time. Most of the time is spent on re-ranking using full-precision vectors. Let's say the breakdown is approximately 20% for quantized vector computations and 80% for full-precision vector computations. So even if the time for quantized vector computations triples, the overall increase in query time would be only about 40%.

If the data distribution shifts, the optimal solution would be to rebuild the index. We believe that HNSW also experiences challenges with data distribution to some extent. However, without rebuilding, our observations suggest that users are more likely to experience slightly longer query times rather than a significant loss in recall.

reply
marcyb5st
21 days ago
[-]
Awesome work! But aren't the comparisons missing ScaNN [1, 2]? I think it's the overall SOTA [3] at the moment regarding vector indexing.

[1] https://github.com/google-research/google-research/tree/mast...

[2] Also available on something like AlloyDB on GCP: https://cloud.google.com/alloydb/docs/ai/store-index-query-v...

[3] https://ann-benchmarks.com/glove-100-angular_10_angular.html

Disclaimer: Working for Google, but nowhere close to Databases.

reply
VoVAllen
21 days ago
[-]
I'm the project lead for VectorChord. I have tested ScaNN on AlloyDB Omni but have struggled to achieve reasonable recall on the GIST 1M dataset, with results peaking at only around 0.8. The limited documentation makes it challenging to understand the underlying causes of this performance.

Additionally, I couldn’t find any performance benchmarks for ScaNN integrated with PostgreSQL, particularly in comparison to pgvector or its standalone. The publicly available metrics focus exclusively on query-only indexing outside of the database.

On our side, we’ve implemented the fastscan kernel for bit vector scanning, which is considered as one of ScaNN’s key advantages.

reply
marcyb5st
21 days ago
[-]
Thanks for the explanation!

Really appreciate it and it makes perfect sense.

reply
7qW24A
21 days ago
[-]
The “external index build” idea seems pretty interesting. How does it work with updates to the underlying data (e.g., new embeddings being added)? For that matter, I guess, how do incremental updates to pgvector’s HNSW indexes work?
reply
VoVAllen
21 days ago
[-]
The IVF indexing can be considered into two phases, computing the centroids (KMeans), and assigning each point to the centroids as the inverted lists. The most time-consuming part is at the KMeans stage, and can be greatly accelerated with GPU. 1M 960dim vec can be clustered in less than 10s. We did the KMeans phase externally, and the assignment phase inside postgres. The KMeans part depends only on the data distribution, not on any specific data. So we can do sampling on the data, and inserting/deleting the data won't affect the KMeans result significantly. For the update, it's just a matter of assigning the new vector to a specific cluster and appending it to the corresponding list. It's very light compared to inserting in hnsw
reply
estebarb
20 days ago
[-]
It would have been nice a comparison with pgvectorscale, which uses binary quantization and StreamingDiskANN.
reply
gaocegege
20 days ago
[-]
We will look into this. We've tried it before, but it took too long to build the index.
reply
_mmarshall
20 days ago
[-]
The cost to store a static set of 400k 768-dimension vectors is also $1 a month on Datastax's AstraDB. However, for that $1, AstraDB replicates the data 3x instead of storing it on a single machine.

Here is a link to the cost calculator. Note that the calculator includes cost of ingestion, but the article only mentions storage costs, not ingestion costs: https://www.datastax.com/pricing/vector-search?cloudProvider...

Disclaimer: I work on vectorsearch/AstraDB at DataStax.

reply
VoVAllen
20 days ago
[-]
It's hard to compare the cost with the Serverless pricing model, as write and read have extra costs. On the pricing page, datastax costs $4000 to write 100M 768-dim vectors. And 10M query will cost $300, which is only 4 QPS. As comparison, VectorChord can achieve 100 QPS on $250 instance.
reply
xieydd
20 days ago
[-]
The pricing estimator is fake, the bill is real : ).
reply
tarasglek
21 days ago
[-]
I am still waiting for a good pattern for using multivector embeddings like ColBert and ColPali in postgres. I get that its fun to optimize single vector stuff, but multivector is that happy middleground between single vector and reranker that seems to be only validated in specialized exotic search dbs like Vespa
reply
jonathan-adly
20 days ago
[-]
I would like to throw our project in the ring that solves this problem: https://github.com/tjmlabs/ColiVara

1. Uses half-vecs, so you cut down everything by half with no recall loss 2. Uses token pooling with hierarchial clustering at 3, so, you further cut down things by 2/3rd with <1% loss 3. Everything is on Postgres and pgvector, so you can do all the Postgres stuff and decrease corpus size by document metadata filtering 4. We have a 5000+ pages corpus in production with <3 seconds latency. 5. We benchmark against the Vidore leaderboard, and very near SOTA

You can read about half-vecs here: https://jkatz05.com/post/postgres/pgvector-scalar-binary-qua...

Hierarchical token pooling: https://www.answer.ai/posts/colbert-pooling.html

And how we implemented them here: https://blog.colivara.com/

reply
__jl__
20 days ago
[-]
I really like the idea of ColPali and products building on it but I am still unsure about the applications for which it makes most sense. We mostly deal with reports that are 80-90% text, 10-20% figures and tables. Does a vision first approach makes sense in this context? My sense is that text-based embeddings are better in mostly text contexts. Layout, for example, is pretty much irrelevant but plays into vision-based approaches. What is your sense about this?
reply
jonathan-adly
20 days ago
[-]
So - the synthetic QAs datasets in the Vidore datasets are exactly like that 90% text, 10% charts/tables. OCR + BM25 is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is at ~98%.

It is a small upgrade, but one nonetheless. The complexity, and the cost of multi-vectors *might* not make this worth it, really depends on how accuracy-critical the task is.

For example, one of our customers who does this over FDA monographs, which is like 95%+ text, and 5% tables - they misses were extremely painful - even though there weren't that many in text-based pipelines. So, the migrations made sense to them.

reply
tarasglek
19 days ago
[-]
Wow, I didn't expect an answer with a product and source. This looks really good, gonna give you guys a try.
reply
VoVAllen
21 days ago
[-]
There's no easy way to index ColBert multi-vectors in a scalable way that I know of. Vespa seems to rely heavily on binary quantization, which can cost a lot in recall loss. And for most cases, using ColBert as a reranker is good enough, as the pgvector example you posted.
reply
tarasglek
21 days ago
[-]
Seems like like doing a proper relational 1:N chunk:multiple-vectors foreign key, binarization and a clever join or multistage CTE would get us pretty close to useful.

I am ok with it being less efficient as the dev ux will be amazing. Vespa ops (even in their cloud) are a complete nightmare compared to postgres

reply
tarasglek
21 days ago
[-]
Looks like that there is recent work to make a pgvector example for this https://github.com/pgvector/pgvector-python/blob/master/exam...
reply
gaocegege
21 days ago
[-]
Hey everyone! We’ve developed a new PostgreSQL extension that supports 400k vectors for just $1. Check it out!
reply
rkuzsma
20 days ago
[-]
Would you be willing to speculate on how VectorChord's ingestion and query performance might compare to Elasticsearch/OpenSearch for dense vector and sparse vector search use cases, particularly when dealing with larger full text data sets (>5M records)?
reply
VoVAllen
20 days ago
[-]
In the LAION-5M benchmark, we’ve compared our performance against ElasticSearch and OpenSearch. However, comparing ingestion performance is more challenging due to differences in architecture. Both ElasticSearch and OpenSearch, like most vector databases, use the concept of shards. Each shard represents a separate vector index, and queries aggregate results across these shards. Larger shards lead to faster queries but come with higher resource requirements and slower update speeds.

It’s also worth noting that ElasticSearch has implemented RaBitQ support for HNSW. So it's difficult to compare without running actual benchmarks. However, ElasticSearch typically requires at least double, if not triple, the memory size of the vector dataset to maintain system stability. In contrast, PostgreSQL can achieve a stable system with far fewer resources—for example, 32GB of memory is sufficient to manage 100 million vectors efficiently.

From my perspective, it would be faster in query comparing to ElasticSearch due to the extensive optimizations. And much much faster with the updates (insert and delete) due to using IVF instead of HNSW.

reply
curl-up
21 days ago
[-]
Does this mean you won't support pgvecto.rs anymore?
reply
VoVAllen
21 days ago
[-]
I'm the project lead for both projects. We're still in the process of supporting all the function from pgvecto.rs in VectorChord (int8, more than 2000 dim vec, etc.). We'll provide the migration docs for pgvecto.rs users to VectorChord. User will have better experience with VectorChord due to better integration with postgres storage system. We will stop supporting pgvecto.rs early next year when everything on VectorChord is ready.
reply
nextworddev
21 days ago
[-]
What dimension vectors are we talking here
reply
jesperwe
21 days ago
[-]
First paragraph of tfa mentions "768-dimensional vectors"
reply
jasonkester
21 days ago
[-]
In five pages of text, we never get to learn what a Vector is (in this context), why we’d want to store one in pgsql, or why it costs so much to store them compared to anything else you’d store there.

For an example of how you can communicate with domain experts, while still giving everyone else some form of clue as to what this hell you’re talking about, check out the link to the product that this thing claims to be a successor to:

https://pgvecto.rs/

That starts off by telling us what it is and what it does.

reply
lja
21 days ago
[-]
That's because this product isn't for you then. My team has been evaluating vector databases for years and everything on the VectorChord page resonated with me. We run one of the world's largest vector databases and we'll likely benchmark vectorchord to see if it lives up to its promises here.
reply
gaocegege
21 days ago
[-]
Hi, we’re here to help if you need assistance (via GitHub issue, Discord, or email). Could you let us know the scale of your vectors—are they 1B or 10B?
reply
lja
21 days ago
[-]
10B, feel free to email me luke@ the domain in my profile.
reply
thelittleone
20 days ago
[-]
Do you have a blog per chance? Or any recommended reading on pre processing / data chunking strategies to improve results?
reply
VoVAllen
20 days ago
[-]
I recently came across a project that looks promising: [WordLlama](https://github.com/dleemiller/WordLlama?tab=readme-ov-file#s...). It appears to be well-suited for semantic chunking, though I haven’t had a chance to try it out yet.
reply
redskyluan
20 days ago
[-]
Maybe check this https://zilliz.com/pricing

Yan easily store 1B data into Zilliz Serverless and the cost is incredible cheap

reply
xieydd
20 days ago
[-]
You cannot compare Serverless with dedicated instance, this is unfair. A large number of writes and queries will eat up your wallet.
reply
redskyluan
20 days ago
[-]
That's true. Now we can offer the new dedicated Cost-optimized CU,pricing is

$0.318/hour ($228.96/month)

Which means you can store 20M 768 dimension data in 228$ per month

reply
lja
20 days ago
[-]
Zilliz gave us a near 6-figure a month quote for our database.
reply
Raed667
21 days ago
[-]
And every hosting provider doesn't start by teaching you what is HTML
reply
curl-up
21 days ago
[-]
So they should start every one of their posts, on that same site, with a summary of what is available on the homepage?
reply
jasonkester
20 days ago
[-]
Wow. It never occurred to me that this might be anything but the landing page of a product.

The title here, the presentation on the page itself. Everything screams "landing page". I had to go back on a desktop browser to see the word "blog" in the url bar, and mentally shift those graphics and little islands of text around until I can view it from that lens. If it's really just a sub-product of the main product that they're talking about, then yeah, it makes more sense in that context.

But my answer to your question would still be "Yes". Absolutely. If you're a product, the job of your blog is to convince people coming off the street that they need your thing, even if they didn't realize it yet.

Step one of that process is to not bounce them back to the street without any idea what they're looking at.

reply