I recently bought a Snapdragon X Elite Copilot+ laptop and realized my integrated Adreno GPU was basically a paperweight for local AI. Standard tools like LM Studio and the massive PyTorch ecosystem didn't support it, forcing everything onto the CPU. I didn't want to wait for the ecosystem to catch up, so I built a from-scratch inference engine to bypass it entirely.
It’s written purely in Rust and WGSL. No CUDA, no Python, no heavy frameworks. Just raw compute shaders dispatching the Transformer forward pass, making it portable (runs on Windows, macOS, Linux via Vulkan/Metal/DX12). Currently, I'm getting ~33 tok/s on the Snapdragon Adreno (around ~25 with fp16) and 66+ tok/s (fp16/fp32) on an RTX 3090 with TinyLlama.
The build process: I actually had a dual motivation here. Beyond solving my hardware gap, I wanted a stress test for my own LLM orchestration tools. A Transformer engine requires exact math, strict buffer layouts (those WebGPU vec3 alignment traps are real), and standalone compute shaders there is zero room for AI hallucination. I spent the time developing and validating a strict architectural blueprint up front. Then, using highly specific prompts, strict behavior guidance, and my custom MCP tools to feed the AI the exact WGSL specs, I successfully scaffolded that predefined human architecture into working code in under 16 hours.
It is very much alpha software. It's decode-only, single-sequence, and currently uses CPU-side sampling. I’d love to hear your thoughts, especially from anyone with deep WGSL/WebGPU experience regarding buffer layouts or optimizing the INT8 GEMM paths (I know I need to move to a tiled implementation to get around the VRAM bandwidth bottleneck).
Happy to answer any questions about the architecture or the build process!