The only thing I think actually solves this is local inference. I remember browing r/LocalLLaMA and years ago and thinking this is the future. Local models are finally good enough. I was playing with the bonsai 8B 1-bit quant model a few weeks back and I think we're almost there. I built friendAI to see if there's market demand for local inference. Everything runs on your phone.
What's actually on-device:
- Bonsai-8B (1-bit quantized Qwen3-8B, ~1.3GB) via MLX for speed - Gemma 4 E2B (~4.5GB, GGUF) via llama.cpp for vision - A unified client that routes between them
A few things I'm reasonably proud of solving in about a week:
- Turns out the hardest part was actually managing the background model downloads that survive crashes, network drops and reboots. You can start chatting before the download finishes. - Runtime thread auto-tuning that benchmarks your actual device at startup rather than guessing with a static heuristic - Local memory without a vector DB. TF-IDF style ranking with recency decay. No embedding model needed.
Happy to go deep on any of it. www.friendai.pro