FilterHN

dmezzetti

7 months ago

[-]

This looks like an interesting project.

Though it's worth noting that the license is AGPL. So if the idea is for this to take over for pgvecto.rs, it's an important data point for those building SaaS products.

It will make pgvector the only permissively licensed option, given it has the same license as Postgres.

6 months ago

[-]

Hi, now we adopted a dual license model (AGPLv3 and ELv2).

whakim

7 months ago

[-]

Could you talk about how updates are handled? My understanding is that IVF can struggle if you're doing a lot of inserts/updates after index creation, as the data needs to be incrementally re-clustered (or the entire index needs to be rebuilt) in order to ensure the clusters continue to reflect the shape of your data?

7 months ago

[-]

We don’t perform any reclustering. As you said, users would need to rebuild the index if they want to recluster. However, based on our observations, the speed remains acceptable even with significant data growth. We did a simple experiment using nlist=1 on the GIST dataset, the top-10 retrieval results took less than twice the time compared to using nlist=4096. This is because only the quantized vectors (with a 32x compression) need to be inserted into the posting list, and only quantized vector distances need more computations. And the quantized vector computation only accounts for a small amount of time. Most of the time is spent on re-ranking using full-precision vectors. Let's say the breakdown is approximately 20% for quantized vector computations and 80% for full-precision vector computations. So even if the time for quantized vector computations triples, the overall increase in query time would be only about 40%.

If the data distribution shifts, the optimal solution would be to rebuild the index. We believe that HNSW also experiences challenges with data distribution to some extent. However, without rebuilding, our observations suggest that users are more likely to experience slightly longer query times rather than a significant loss in recall.

[1] https://github.com/google-research/google-research/tree/mast...

marcyb5st

7 months ago

[-]

Awesome work! But aren't the comparisons missing ScaNN [1, 2]? I think it's the overall SOTA [3] at the moment regarding vector indexing.

[2] Also available on something like AlloyDB on GCP: https://cloud.google.com/alloydb/docs/ai/store-index-query-v...

[3] https://ann-benchmarks.com/glove-100-angular_10_angular.html

Disclaimer: Working for Google, but nowhere close to Databases.

7 months ago

[-]

I'm the project lead for VectorChord. I have tested ScaNN on AlloyDB Omni but have struggled to achieve reasonable recall on the GIST 1M dataset, with results peaking at only around 0.8. The limited documentation makes it challenging to understand the underlying causes of this performance.

Additionally, I couldn’t find any performance benchmarks for ScaNN integrated with PostgreSQL, particularly in comparison to pgvector or its standalone. The publicly available metrics focus exclusively on query-only indexing outside of the database.

On our side, we’ve implemented the fastscan kernel for bit vector scanning, which is considered as one of ScaNN’s key advantages.

marcyb5st

7 months ago

[-]

Thanks for the explanation!

Really appreciate it and it makes perfect sense.

7qW24A

7 months ago

[-]

The “external index build” idea seems pretty interesting. How does it work with updates to the underlying data (e.g., new embeddings being added)? For that matter, I guess, how do incremental updates to pgvector’s HNSW indexes work?

7 months ago

[-]

The IVF indexing can be considered into two phases, computing the centroids (KMeans), and assigning each point to the centroids as the inverted lists. The most time-consuming part is at the KMeans stage, and can be greatly accelerated with GPU. 1M 960dim vec can be clustered in less than 10s. We did the KMeans phase externally, and the assignment phase inside postgres. The KMeans part depends only on the data distribution, not on any specific data. So we can do sampling on the data, and inserting/deleting the data won't affect the KMeans result significantly. For the update, it's just a matter of assigning the new vector to a specific cluster and appending it to the corresponding list. It's very light compared to inserting in hnsw

estebarb

7 months ago

[-]

It would have been nice a comparison with pgvectorscale, which uses binary quantization and StreamingDiskANN.

7 months ago

[-]

We will look into this. We've tried it before, but it took too long to build the index.

_mmarshall

7 months ago

[-]

The cost to store a static set of 400k 768-dimension vectors is also $1 a month on Datastax's AstraDB. However, for that $1, AstraDB replicates the data 3x instead of storing it on a single machine.

Here is a link to the cost calculator. Note that the calculator includes cost of ingestion, but the article only mentions storage costs, not ingestion costs: https://www.datastax.com/pricing/vector-search?cloudProvider...

Disclaimer: I work on vectorsearch/AstraDB at DataStax.

7 months ago

[-]

It's hard to compare the cost with the Serverless pricing model, as write and read have extra costs. On the pricing page, datastax costs $4000 to write 100M 768-dim vectors. And 10M query will cost $300, which is only 4 QPS. As comparison, VectorChord can achieve 100 QPS on $250 instance.

xieydd

7 months ago

[-]

The pricing estimator is fake, the bill is real : ).

7 months ago

[-]

I am still waiting for a good pattern for using multivector embeddings like ColBert and ColPali in postgres. I get that its fun to optimize single vector stuff, but multivector is that happy middleground between single vector and reranker that seems to be only validated in specialized exotic search dbs like Vespa

jonathan-adly

7 months ago

[-]

I would like to throw our project in the ring that solves this problem: https://github.com/tjmlabs/ColiVara

1. Uses half-vecs, so you cut down everything by half with no recall loss 2. Uses token pooling with hierarchial clustering at 3, so, you further cut down things by 2/3rd with <1% loss 3. Everything is on Postgres and pgvector, so you can do all the Postgres stuff and decrease corpus size by document metadata filtering 4. We have a 5000+ pages corpus in production with <3 seconds latency. 5. We benchmark against the Vidore leaderboard, and very near SOTA

You can read about half-vecs here: https://jkatz05.com/post/postgres/pgvector-scalar-binary-qua...

Hierarchical token pooling: https://www.answer.ai/posts/colbert-pooling.html

And how we implemented them here: https://blog.colivara.com/

__jl__

7 months ago

[-]

I really like the idea of ColPali and products building on it but I am still unsure about the applications for which it makes most sense. We mostly deal with reports that are 80-90% text, 10-20% figures and tables. Does a vision first approach makes sense in this context? My sense is that text-based embeddings are better in mostly text contexts. Layout, for example, is pretty much irrelevant but plays into vision-based approaches. What is your sense about this?

jonathan-adly

7 months ago

[-]

So - the synthetic QAs datasets in the Vidore datasets are exactly like that 90% text, 10% charts/tables. OCR + BM25 is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is at ~98%.

It is a small upgrade, but one nonetheless. The complexity, and the cost of multi-vectors *might* not make this worth it, really depends on how accuracy-critical the task is.

For example, one of our customers who does this over FDA monographs, which is like 95%+ text, and 5% tables - they misses were extremely painful - even though there weren't that many in text-based pipelines. So, the migrations made sense to them.

7 months ago

[-]

Wow, I didn't expect an answer with a product and source. This looks really good, gonna give you guys a try.

7 months ago

[-]

There's no easy way to index ColBert multi-vectors in a scalable way that I know of. Vespa seems to rely heavily on binary quantization, which can cost a lot in recall loss. And for most cases, using ColBert as a reranker is good enough, as the pgvector example you posted.

7 months ago

[-]

Seems like like doing a proper relational 1:N chunk:multiple-vectors foreign key, binarization and a clever join or multistage CTE would get us pretty close to useful.

I am ok with it being less efficient as the dev ux will be amazing. Vespa ops (even in their cloud) are a complete nightmare compared to postgres

7 months ago

[-]

Looks like that there is recent work to make a pgvector example for this https://github.com/pgvector/pgvector-python/blob/master/exam...

7 months ago

[-]

Hey everyone! We’ve developed a new PostgreSQL extension that supports 400k vectors for just $1. Check it out!

rkuzsma

7 months ago

[-]

Would you be willing to speculate on how VectorChord's ingestion and query performance might compare to Elasticsearch/OpenSearch for dense vector and sparse vector search use cases, particularly when dealing with larger full text data sets (>5M records)?

7 months ago

[-]

In the LAION-5M benchmark, we’ve compared our performance against ElasticSearch and OpenSearch. However, comparing ingestion performance is more challenging due to differences in architecture. Both ElasticSearch and OpenSearch, like most vector databases, use the concept of shards. Each shard represents a separate vector index, and queries aggregate results across these shards. Larger shards lead to faster queries but come with higher resource requirements and slower update speeds.

It’s also worth noting that ElasticSearch has implemented RaBitQ support for HNSW. So it's difficult to compare without running actual benchmarks. However, ElasticSearch typically requires at least double, if not triple, the memory size of the vector dataset to maintain system stability. In contrast, PostgreSQL can achieve a stable system with far fewer resources—for example, 32GB of memory is sufficient to manage 100 million vectors efficiently.

From my perspective, it would be faster in query comparing to ElasticSearch due to the extensive optimizations. And much much faster with the updates (insert and delete) due to using IVF instead of HNSW.

curl-up

7 months ago

[-]

Does this mean you won't support pgvecto.rs anymore?

7 months ago

[-]

I'm the project lead for both projects. We're still in the process of supporting all the function from pgvecto.rs in VectorChord (int8, more than 2000 dim vec, etc.). We'll provide the migration docs for pgvecto.rs users to VectorChord. User will have better experience with VectorChord due to better integration with postgres storage system. We will stop supporting pgvecto.rs early next year when everything on VectorChord is ready.

nextworddev

7 months ago

[-]

What dimension vectors are we talking here

jesperwe

7 months ago

[-]

First paragraph of tfa mentions "768-dimensional vectors"

jasonkester

7 months ago

[-]

In five pages of text, we never get to learn what a Vector is (in this context), why we’d want to store one in pgsql, or why it costs so much to store them compared to anything else you’d store there.

For an example of how you can communicate with domain experts, while still giving everyone else some form of clue as to what this hell you’re talking about, check out the link to the product that this thing claims to be a successor to:

https://pgvecto.rs/

That starts off by telling us what it is and what it does.

lja

7 months ago

[-]

That's because this product isn't for you then. My team has been evaluating vector databases for years and everything on the VectorChord page resonated with me. We run one of the world's largest vector databases and we'll likely benchmark vectorchord to see if it lives up to its promises here.

7 months ago

[-]

Hi, we’re here to help if you need assistance (via GitHub issue, Discord, or email). Could you let us know the scale of your vectors—are they 1B or 10B?

lja

7 months ago

[-]

10B, feel free to email me luke@ the domain in my profile.

thelittleone

7 months ago

[-]

Do you have a blog per chance? Or any recommended reading on pre processing / data chunking strategies to improve results?

7 months ago

[-]

I recently came across a project that looks promising: [WordLlama](https://github.com/dleemiller/WordLlama?tab=readme-ov-file#s...). It appears to be well-suited for semantic chunking, though I haven’t had a chance to try it out yet.

redskyluan

7 months ago

[-]

Maybe check this https://zilliz.com/pricing

Yan easily store 1B data into Zilliz Serverless and the cost is incredible cheap

xieydd

7 months ago

[-]

You cannot compare Serverless with dedicated instance, this is unfair. A large number of writes and queries will eat up your wallet.

redskyluan

7 months ago

[-]

That's true. Now we can offer the new dedicated Cost-optimized CU，pricing is

$0.318/hour ($228.96/month)

Which means you can store 20M 768 dimension data in 228$ per month

lja

7 months ago

[-]

Zilliz gave us a near 6-figure a month quote for our database.

Raed667

7 months ago

[-]

And every hosting provider doesn't start by teaching you what is HTML

curl-up

7 months ago

[-]

So they should start every one of their posts, on that same site, with a summary of what is available on the homepage?