The Physics: By mapping a 64-byte aligned .r3 file directly from NVMe to CPU L3 Cache (Zero-Copy) and using AVX-512 VPOPCNTDQ for branchless math, the Ryzen 9950X3D achieves 117 Tokens/Second latency.
The Problem: The AI is mute (Outputting <unk>*)* The matrix multiplication pipeline is mathematically complete, but the output is stuck at Token ID 0 (<unk>). The issue lies in the transition between the quantized weights and the float-based non-linear activations.
Where I need expert input:
Weight Tying in BitNet: Microsoft's 2B model ties Embeddings with the LM Head. I am cloning the embedding matrix for the output projection, but I suspect a scaling factor is missing.
RMSNorm & SiLU in 1.58-bit: How should the raw integer accumulators (from the VPOPCNTDQ loop) be scaled before entering the SiLU activation and the subsequent layer?
GitHub Repo: https://github.com/r3-engine/r3-engineIf you know the physics of LLM Logit Sampling or ternary activation math, I would love your eyes on the codebase.