Benchmarks on NVIDIA H100 with Mistral-7B:
• Constant latency: 14.168-14.305ms across 512 → 524,288 tokens (0.96% variance) • 24.5x faster than FlashAttention v2.8.3 at 32K tokens • O(N log N) memory complexity vs O(N²) • Zero throughput degradation (FlashAttention shows 76% loss from 4K→32K) • Successfully executes at 524K tokens (FlashAttention OOMs beyond 32K)
Full benchmark data: https://github.com/RegularJoe-CEO/vllm/blob/waller-operator-...
FlashAttention baseline for comparison: https://github.com/vllm-project/vllm/pull/33860
The kernel achieves ~492-496 TFLOPS consistently across all sequence lengths.
Looking for feedback on the approach and additional validation suggestions.