FilterHN

Show HN: A Zero-Copy 1.58-bit LLM Engine hitting 117 Tokens/s on single CPU core

2 points

3 hours ago

| 0 comments

The Project: I am building R3-Engine, a from-scratch, local AI inference engine for Microsoft's bitnet-b1.58-2B-4T. It is written in 100% Safe Rust, natively cross-compiles to Wasm SIMD128, and uses Zero heap allocations in the execution loop.

The Physics: By mapping a 64-byte aligned .r3 file directly from NVMe to CPU L3 Cache (Zero-Copy) and using AVX-512 VPOPCNTDQ for branchless math, the Ryzen 9950X3D achieves 117 Tokens/Second latency.

The Problem: The AI is mute (Outputting <unk>*)* The matrix multiplication pipeline is mathematically complete, but the output is stuck at Token ID 0 (<unk>). The issue lies in the transition between the quantized weights and the float-based non-linear activations.

Where I need expert input:

    Weight Tying in BitNet: Microsoft's 2B model ties Embeddings with the LM Head. I am cloning the embedding matrix for the output projection, but I suspect a scaling factor is missing.

    RMSNorm & SiLU in 1.58-bit: How should the raw integer accumulators (from the VPOPCNTDQ loop) be scaled before entering the SiLU activation and the subsequent layer?

GitHub Repo: https://github.com/r3-engine/r3-engine

If you know the physics of LLM Logit Sampling or ternary activation math, I would love your eyes on the codebase.

No one has commented on this post.