Ichigo: Local real-time voice AI
202 points
3 days ago
| 10 comments
| github.com
| HN
There was an announcement on LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1g38e9s/ichigol...

There were several links:

- Blog for details: https://homebrew.ltd/blog/llama-learns-to-talk

- Code: https://github.com/homebrewltd/ichigo

- Run locally: https://github.com/homebrewltd/ichigo-demo/tree/docker

- Demo on a single 3090: https://ichigo.homebrew.ltd/

emreckartal
1 day ago
[-]
Emre here from Homebrew Research. It's great to see Ichigo on HN!

A quick intro: We're a Local AI company building local AI tools and training open-source models.

Ichigo is our training method that enables LLMs to understand human speech and talk back with low latency - thanks to FishSpeech integration. It is open data, open weights, and weight initialized with Llama 3.1, extending its reasoning ability.

Plus, we are the creators and lead maintainers of: https://jan.ai/, Local AI Assistant - an alternative to ChatGPT & https://cortex.so/, Local AI Toolkit (soft launch coming soon)

Everything we build and train is done out in the open - we share our progress on:

https://x.com/homebrewltd https://discord.gg/hTmEwgyrEg

You can check out all our products on our simple website: https://homebrew.ltd/

reply
gnuly
1 day ago
[-]
any plans to share progress on open channels like matrix.org or even irc?
reply
thruflo
1 day ago
[-]
Great stuff. Voice AI is great to run locally not just for privacy / access to personal data but also because of the low latency requirement. If there's a delay in conversation caused by a network call, it just feels weird, like an old satellite phone call.
reply
cassepipe
3 days ago
[-]
Finally I can use one of the random facts that have entered my brain for decades now even though I can't remember where my keys are.

If I remember correctly, "ichigo" means strawberry in japanese. You are welcome.

reply
SapporoChris
1 day ago
[-]
Sorry, you're wrong. It means 1 5. Just kidding, it is strawberry but it can also be read as one and five. However, it is not fifteen.
reply
TheCraiggers
19 hours ago
[-]
> it can also be read as one and five. However, it is not fifteen.

Can you help me wrap my brain around this? Does it mean six? I'm struggling to understand how a word can mean two numbers and how this would actually be used in a conversation.

Thanks. I'm curious and trying to search for this to understand just returns anime.

reply
BugsJustFindMe
19 hours ago
[-]
> I'm struggling to understand how a word can mean two numbers

Ichi is the word for 1. Go is the word for 5.

reply
TheCraiggers
19 hours ago
[-]
/smacks forehead.

Can't believe I fell for that.

reply
gardenmud
7 hours ago
[-]
I mean, it wasn't really a trick.

It's truly the exact same as someone saying "onefive can be read as (one five), but it's not (fifteen)" - to a non-English speaker I mean - I don't read 'prank' in that statement

reply
d3w3y
1 day ago
[-]
There are strawberries all over the readme so I reck you're right.
reply
mmastrac
1 day ago
[-]
Is this a continuation of the meme that GPT can't identify the number of "R"s in "strawberry"?
reply
TheDong
1 day ago
[-]
> How many 'r's are in the word 'ichigo'?

GPT 4o: The word "ichigo," which is the Romanized spelling (romaji) of いちご, contains one "r." It appears in the letter "r" in "chi," as the "ch" sound in romaji represents a combination of the "r" sound from "r" and "t" sound from "i."

Thank you chatgpt. I'm glad we've burned down a bunch of forests for this.

You can consistently get the right answer with a prompt of:

> Write python code, and run it, to count the number of 'r' characters in いちご.

though. For numeric stuff, telling the thing to just write python code makes it significantly better at getting right answers.

reply
BugsJustFindMe
19 hours ago
[-]
Without any special prompt change, I get

There are no “r”s in the word “ichigo.”

Maybe your instructions are bad.

reply
dev-jayson
1 day ago
[-]
I think you might be on to something
reply
greydius
15 hours ago
[-]
I think it's a bit of word play. 苺 (strawberry) and 一語 (one word) are both read "Ichigo".
reply
AtlasBarfed
1 day ago
[-]
Getsuga tenshou!!
reply
dumb1224
20 hours ago
[-]
haha was looking for that!

Ban-kai 卍解

reply
beretguy
1 day ago
[-]
Tatakae!
reply
adammarples
18 hours ago
[-]
From the book tomorrow and tomorrow and tomorrow?
reply
zarmin
1 day ago
[-]
Your keys are in the fridge with the remote control.
reply
tmshapland
1 day ago
[-]
This is a really cool project! What have people built with it? I'd love to learn about what local apps people are building on this.
reply
emreckartal
1 day ago
[-]
Thanks! We've received feedback on use cases like live translation, safe and untrackable educational tools for kids, and language-learning apps. There are so many possibilities, and hope to see guys building amazing products on top of Ichigo.
reply
itake
1 day ago
[-]
I just tried to use the demo website for live translation. The AI always responded in English, either ignoring my request to only respond in French or Lao, or preface the translation with english ("I can translate that to French. the translation is: ...").

I'm trying to use chatgpt for ai translation, but the other big problem I run into is TTS and SST on non-top 40 languages (e.g. lao). Facebook has a TTS library, but it isn't open for commercial use unfortunately.

reply
emreckartal
1 day ago
[-]
Oh, I see. We've limited it to English for simplicity for the demo. More languages are planned for future releases.
reply
itake
8 hours ago
[-]
what is the limiting factor to all llama languages for stt or tts?
reply
famahar
1 day ago
[-]
Looks impressive. I'm guessing the demo isn't representative of the full possibilities of this? Tried to have a basic conversation in Japanese and it kept on sticking with English. When it did eventually speak Japanese the pronunciation was completely off. I'm really excited about the possibility of local language learning with near realtime conversation practice. Will keep an eye on this.
reply
mentalgear
1 day ago
[-]
Kudos to the team, this is truly impressive work! It's exciting to see how AI connects with the local-first movement, which is also really exploding in popularity. (The idea of local-first, where data processing and functionality are prioritized on users' own devices, aligns perfectly with emerging privacy concerns and the push for decentralization.)

Bringing AI into this space enhances user experience while respecting their autonomy over data. It feels like a promising step toward a future where we can leverage the power of AI without compromising on privacy or control. Really looking forward to seeing how this evolves!

reply
cchance
1 day ago
[-]
its amazing to see cool projects like this really REALLY based in opensource and open training like this wow
reply
emreckartal
1 day ago
[-]
Thanks! It's all open research, source code, data, and weights.
reply
frankensteins
1 day ago
[-]
Great initiative! before adding more comments, I'm trying to set up on my local Mac M3 machine. I'm having a hard time to install dependencies. Anyone here have the same issue?
reply
emreckartal
1 day ago
[-]
Thanks! You can't run Ichigo on a Mac M3 just yet. It'll be possible to run it locally on Mac once we integrate it with Jan.ai
reply
p0larboy
1 day ago
[-]
Tried demo but all I got was "I'm sorry, I can't quite catch that".
reply
emreckartal
1 day ago
[-]
We're running the demo on a single 3090, so it may sometimes be a bit buggy. You can try running it locally here: https://github.com/homebrewltd/ichigo-demo/tree/docker

The documentation isn't very detailed yet, but we're planning to improve it and add support for various hardware.

reply
lostmsu
1 day ago
[-]
Very cool, but a bit less practical than some alternatives because it does not seem to do request transcription.
reply
emreckartal
1 day ago
[-]
Actually, it does. You can turn on the transcription feature from the bottom right corner and even type to Ichigo if you want. We didn’t show it in the launch video since we were focusing on the verbal interaction side of things.
reply
emreckartal
1 day ago
[-]
Ah, I see now.

To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.

The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.

Hope this clears things up!

reply
lostmsu
11 hours ago
[-]
I understand that. The problem is that in many scenarios users would want to see transcripts of what they said alongside the model output. Like if I have a chat with a model about choosing a place to move to, I would probably also want to review it later. And when I review it, I will see: me: /audio record/ AI: 200-300m. No easy way to see at glance what the AI answer was about.
reply