Recall improves with scale at fp16: 96% at 400M → 98% at 500M → 99% at 600M → 100% at 1B. This is the opposite of ANN indices, where recall degrades with scale. More data helps the decomposition converge.
Every number below is measured, not projected. Full benchmark suite across 4 GPUs at 3 precision tiers. H100 80GB, 384-dim embeddings, rank=32.
fp16 (Scale tier) — H100 80GB:
100M: 5.87ms p50, 6.6 GB serving, 100% recall, 46.5x compression
500M: 20.54ms p50, 33.0 GB serving, 98% recall, 46.5x compression
1B: 38.51ms p50, 66.0 GB serving, 100% recall, 46.5x compression
fp32 (Production tier) — H100 80GB:
100M: 18.96ms p50, 13.2 GB serving, 100% recall, 23.3x compression
300M: 46.29ms p50, 39.6 GB serving, 100% recall, 23.3x compression
500M: 76.53ms p50, 66.0 GB serving, 100% recall, 23.3x compression
fp64 (Exact tier) — H100 80GB:
10M: 2.68ms p50, 2.6 GB serving, 100% recall, 11.6x compression
100M: 20.40ms p50, 26.4 GB serving, 100% recall, 11.6x compression
200M: 40.40ms p50, 51.6 GB serving, 100% recall, 11.6x compressionHardware portability — same codebase, different GPUs: P4000 8 GB: fp16 50M, 26ms p50, 100% recall, $0.07/hr A100 40 GB: fp16 200M, 3.1ms p50, 98% recall, $0.70/hr H100 80 GB: fp16 500M, 3.1ms p50, 98% recall, $2.09/hr B200 192 GB: fp16 500M, 1.4ms p50, 99% recall, est. ~$5/hr Recall is hardware-invariant. Same math, same results, P4000 through H100. The 2B run on B200 is in progress. How it works: The dataset X (N×D) is factored as X ≈ Z · V_T where Z is (N×r) and V_T is (r×D), with r=32. Query is a single GEMV: scores = Z · (V_T · q). Bytes per entry: 2r bytes at fp16 = 64 bytes regardless of embedding dimension. A 1536-dim OpenAI ada-002 embedding compresses 23.6x at fp32 with zero recall loss.
Compression is dimension-independent:
384-dim MiniLM: 11.6x, 100% recall
768-dim E5-large: 11.8x, 100% recall
1024-dim Cohere v3: 15.8x, 100% recall
1536-dim ada-002: 23.6x, 100% recall
Operational details (H100, 10M vectors):
QPS: 317 single client, 183 at 100 concurrent
Cold start: 8.88s from snapshot to first query
24h soak: 2.9M queries, 8.6M inserts, zero data corruption
Insert-under-query: 885 inserts/s concurrent with 101 QPS
All artifacts (JSON + logs) available
Build uses streaming randomized SVD — peak VRAM equals serving size, not dataset size. The 2B run on B200 uses streaming coefficient regeneration so the 512 GB coefficient matrix is never fully allocated in RAM.
brad@holonomx.comMetric Value
N 2,000,000,000
GPU NVIDIA B200 (191.5 GB)
Build 1794.2s (Pass 1: 637s, Pass 2: 1157s)
Serving 132.00 GB (Z=[2B, 32] fp16)
Compression 46.5× fp64, 23.3× fp32
Query p50 60.89 ms
Query p99 62.58 ms
R@10 mean 98.0%
R@10 min 90.0%
VRAM serving 131 GB
VRAM query peak 142 GB
CPU RAM post 9 GB
Total wall 2693.6s (~45 min)