FilterHN

Show HN: Pg_sorted_heap–Physically sorted PostgreSQL with builtin vector search

9 points

by skuznetsov37

21 hours ago

| past

| 1 comment

| github.com

| HN

▲

skuznetsov37

21 hours ago

[-]

  Author here. pg_sorted_heap is a PostgreSQL table AM extension that does two things:

  1. Keeps data physically sorted by primary key with per-page zone maps. At 100M rows a point query reads 1 buffer page (0.045ms) vs 8 for btree (0.5ms) vs 520K for seq scan (1.2s).

  2. Built-in IVF-PQ vector search — no pgvector dependency.

  The vector search part is where it gets interesting. The physical clustering by PK prefix IS the inverted file index. You set partition_id (IVF cluster assignment) as the leading PK column, sorted_heap
  clusters rows by it, and the zone map skips irrelevant partitions at the I/O level. No separate index structure, no 800 MB HNSW graph.

  Numbers (103K × 2880-dim vectors, 1 Gi k8s pod):

    pgvector HNSW:  97% R@1, 14ms, 806 MB index, max 2,000 dims
    IVF-PQ:         97% R@1, 22ms,  27 MB index, max 32,000 dims

  Two vector types: svec (float32, 16K dims) and hsvec (float16, 32K dims). We tested float16 vs float32 on the same dataset — no measurable recall difference. PQ quantization is the bottleneck, not storage
  precision.

  The 2,000-dim limit matters: models like Nomic embed v2 output 2880 dims, and pgvector simply can't build HNSW or IVFFlat indexes for them.

  PostgreSQL 17 + 18, Apache 2.0. Full crash recovery, online compaction, TOAST, pg_upgrade all tested.

  Vector search docs: https://skuznetsov.github.io/pg_sorted_heap/vector-search

  Happy to answer questions about the architecture, benchmarks, or trade-offs vs pgvector.