For what it's worth, it seems like there's a bunch of open source NPU work in progress too. There's a layer "TEFLON" for Gallium3D shared by most of these drivers, that TensorFlow can use. Then hardware drivers for Rockchip (via ROCKET driver), and Vivante (with their Etnaviv drivers). It'd be extra interesting now to see how (or if?) they've dealt with the system constraints (small scratchpad size) here. https://www.phoronix.com/news/Gallium3D-Teflon-Merged https://www.phoronix.com/news/Rockchip-NPU-Linux-Mesa https://www.phoronix.com/news/Two-NPU-Accel-Drivers-2026
> *The main reason I stuck with the closed-source `rknn` stack for this specific project was operator support for Transformers. Teflon is getting great at standard CNN ops (Fused ReLU, Convs, etc.), but the SigLIP vision encoder relies on massive Transposes and unbounded GELU activations that currently fall off the 'happy path' in the open stack.*
> *To your point on the system constraints (small scratchpad): I suspect the current open-source drivers would hit the exact same 32KB SRAM wall I found. The hardware simply refuses to tile large matrices automatically. My 'Nano-Tiling' fix was a software-level patch; porting that logic into the Mesa driver itself would probably be the 'Holy Grail' fix here.*