Constant 14ms attention: 512→524K tokens (24.5x faster than FlashAttention)
1 points
2 hours ago
| 1 comment
| github.com
| HN
luxiedge
2 hours ago
[-]
I've developed an attention mechanism called the Waller Operator that maintains constant ~14ms latency regardless of sequence length, compared to FlashAttention's O(N²) scaling.

Benchmarks on NVIDIA H100 with Mistral-7B:

• Constant latency: 14.168-14.305ms across 512 → 524,288 tokens (0.96% variance) • 24.5x faster than FlashAttention v2.8.3 at 32K tokens • O(N log N) memory complexity vs O(N²) • Zero throughput degradation (FlashAttention shows 76% loss from 4K→32K) • Successfully executes at 524K tokens (FlashAttention OOMs beyond 32K)

Full benchmark data: https://github.com/RegularJoe-CEO/vllm/blob/waller-operator-...

FlashAttention baseline for comparison: https://github.com/vllm-project/vllm/pull/33860

The kernel achieves ~492-496 TFLOPS consistently across all sequence lengths.

Looking for feedback on the approach and additional validation suggestions.

reply