Show HN: Local LLM on a Pi 4 controlling hardware via tool calling
3 points
1 hour ago
| 2 comments
| github.com
| HN
stfurkan
1 hour ago
[-]
Hi HN,

I spent the weekend experimenting to see if I could get a proper LLM running locally on an old Raspberry Pi 4 (4GB), and more importantly, if I could get it to interact with the physical world.

I ended up using PrismML's new Bonsai models. Because they are genuinely 1-bit (trained from scratch at 1-bit, not quantized down to 4-bit), they actually fit. The 4B parameter model is ~570 MB, and the 1.7B is ~240 MB.

I loaded them through llama.cpp's router mode. I get around 2 tok/s on the 4B model for better reasoning, and 4-5 tok/s on the 1.7B when I just need speed. I tried Gemma 4 E2B first, but it was just too slow on 4GB of RAM.

The fun part: I wired up a cheap TM1637 4-digit display to the GPIO pins. Since Bonsai supports native tool calling, I wrote a small Python proxy that injects an update_display function into requests. When the model decides to use the tool, the proxy catches the streaming call, extracts the text, and drives the display. You can tell it to "show 1453" and it physically lights up.

It’s definitely just a weekend project (7-segment displays can't render W or M, self-signed certs, etc.). The code and setup scripts are all in the repo.

I’m thinking about adding servos or sensors next. Would love to hear your thoughts or see if anyone else is building edge AI hardware projects!

reply
trailheadsec
1 hour ago
[-]
What’s the quality of the model output at this RAM / model selection? Local models fascinate me; I run Ollama on an M1 Max MacBook Pro with 64GB of RAM, but I am a little bit inexperienced with the ins and outs. Thank you for sharing!
reply
stfurkan
56 minutes ago
[-]
I specifically chose PrismML's 1-bit models because their tiny size allows them to actually fit on smaller hardware like the Pi. The 1.7B model is great for basic tasks and tool triggers, while the 4B model seems reasonable for some daily tasks, though it's much slower on this setup. If you try these models on your M1 Max, I assume they'll run incredibly fast. I previously tried them on a VPS and the inference speed was really good for my experiment.
reply