Show HN: 1B Embeddings
3 points
4 hours ago
| 2 comments
| HN
We built a vector search engine based on Quantized Tensor Train (QTT) decomposition. Instead of approximate nearest neighbor (ANN) indices like HNSW or IVF, we factorize the entire dataset into a compressed tensor format and serve exact cosine similarity queries directly from the compressed representation. The headline: 1 billion vectors on a single H100, 38ms query, 100% recall, 66 GB serving.

Recall improves with scale at fp16: 96% at 400M → 98% at 500M → 99% at 600M → 100% at 1B. This is the opposite of ANN indices, where recall degrades with scale. More data helps the decomposition converge.

Every number below is measured, not projected. Full benchmark suite across 4 GPUs at 3 precision tiers. H100 80GB, 384-dim embeddings, rank=32.

fp16 (Scale tier) — H100 80GB:

  100M:  5.87ms p50,  6.6 GB serving, 100% recall, 46.5x compression
  500M: 20.54ms p50, 33.0 GB serving,  98% recall, 46.5x compression
    1B: 38.51ms p50, 66.0 GB serving, 100% recall, 46.5x compression
fp32 (Production tier) — H100 80GB: 100M: 18.96ms p50, 13.2 GB serving, 100% recall, 23.3x compression 300M: 46.29ms p50, 39.6 GB serving, 100% recall, 23.3x compression 500M: 76.53ms p50, 66.0 GB serving, 100% recall, 23.3x compression fp64 (Exact tier) — H100 80GB: 10M: 2.68ms p50, 2.6 GB serving, 100% recall, 11.6x compression 100M: 20.40ms p50, 26.4 GB serving, 100% recall, 11.6x compression 200M: 40.40ms p50, 51.6 GB serving, 100% recall, 11.6x compression

Hardware portability — same codebase, different GPUs: P4000 8 GB: fp16 50M, 26ms p50, 100% recall, $0.07/hr A100 40 GB: fp16 200M, 3.1ms p50, 98% recall, $0.70/hr H100 80 GB: fp16 500M, 3.1ms p50, 98% recall, $2.09/hr B200 192 GB: fp16 500M, 1.4ms p50, 99% recall, est. ~$5/hr Recall is hardware-invariant. Same math, same results, P4000 through H100. The 2B run on B200 is in progress. How it works: The dataset X (N×D) is factored as X ≈ Z · V_T where Z is (N×r) and V_T is (r×D), with r=32. Query is a single GEMV: scores = Z · (V_T · q). Bytes per entry: 2r bytes at fp16 = 64 bytes regardless of embedding dimension. A 1536-dim OpenAI ada-002 embedding compresses 23.6x at fp32 with zero recall loss.

Compression is dimension-independent:

  384-dim  MiniLM:     11.6x, 100% recall
  768-dim  E5-large:   11.8x, 100% recall
  1024-dim Cohere v3:  15.8x, 100% recall
  1536-dim ada-002:    23.6x, 100% recall
Operational details (H100, 10M vectors): QPS: 317 single client, 183 at 100 concurrent Cold start: 8.88s from snapshot to first query 24h soak: 2.9M queries, 8.6M inserts, zero data corruption Insert-under-query: 885 inserts/s concurrent with 101 QPS All artifacts (JSON + logs) available Build uses streaming randomized SVD — peak VRAM equals serving size, not dataset size. The 2B run on B200 uses streaming coefficient regeneration so the 512 GB coefficient matrix is never fully allocated in RAM. brad@holonomx.com
INVARIAN
3 hours ago
[-]
B200 2B fp16 — complete.

Metric Value

N 2,000,000,000

GPU NVIDIA B200 (191.5 GB)

Build 1794.2s (Pass 1: 637s, Pass 2: 1157s)

Serving 132.00 GB (Z=[2B, 32] fp16)

Compression 46.5× fp64, 23.3× fp32

Query p50 60.89 ms

Query p99 62.58 ms

R@10 mean 98.0%

R@10 min 90.0%

VRAM serving 131 GB

VRAM query peak 142 GB

CPU RAM post 9 GB

Total wall 2693.6s (~45 min)

reply
INVARIAN
3 hours ago
[-]
2.5B: Incoming
reply