To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock
Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.
The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.
On a Nvidia 4090, it's producing:
prompt eval time = 17.93 ms / 24 tokens ( 0.75 ms per token, 1338.39 tokens per second)
eval time = 2382.95 ms / 421 tokens ( 5.66 ms per token, 176.67 tokens per second)
total time = 2400.89 ms / 445 tokens
*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.You can run `python gguf_orpheus.py --text "Hello, this is a test" --voice tara` and connect to the llama-server
See https://github.com/isaiahbjork/orpheus-tts-local
See my GH issue example output https://github.com/isaiahbjork/orpheus-tts-local/issues/15
Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.
[1]: https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA
[2]: https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7
I was such a fan of CoquiTTS and so happy when they launched a commercially licensed offering. I didn't mind taking a small hit on quality if it enabled us to support them.
And then, the quality of the API outputs were lower than what the self-hosted open source Coqui model provided... I'm thinking this was one of the reasons usage was not at the level they hoped for, and they ended up folding.
The saddest part is they still didn't assign commercial rights to the open-source model, so I think Coqui is in a dead-end now.
Crazy.
Though I still wish open source to better than elevenlabs. but its all just a dream.
Orpheus would be great to get wired up. I’m wondering how well their smallest model will run and if it will be fast enough for realtime
It's the vocal equivalent of a triple-jointed arm, or a horizon that's different on the left and right side of a portrait.
Would any of the models run on something like a raspberry pi?
How about a smartphone?
That said if you want something to use today on a Pi you should check out Kokoro
You can also point sherpa_onnx in your pubspec.yaml file to a local dir (after cloning the repo somewhere on your file system) or point to a specific git commit hash, and don't forget to specify the path because its not the root of the repo. Here's a link to the dir of the flutter package https://github.com/k2-fsa/sherpa-onnx/tree/master/flutter.
The downloads of compatible models can be found at their GitHub Releases but tbh it's a bit of a strange setup IMO. Here's the page for TTS models for example: https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-model...
That same "release" (page) gets updated from time to time with new models. Use a bookmark :p
-
@csukuangfj, thanks for sharing the hard work. Nice to see you here.
Two questions / thoughts:
1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?
2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)
> the code in this repo is Apache 2 now added, the model weights are the same as the Llama license as they are a derivative work.
https://github.com/canopyai/Orpheus-TTS/issues/33#issuecomme...
However it's not a very good reading of the script, in human terms. It feels even more forced and phony than aforementioned influencers.
- in the prompt "SO serious" it pronounces each letter as "ess oh" instead of emphasizing the word "so"
- there's no breathing sounds or natural breathing based pauses
Choosing which words in a sentence to emphasize can completely change the meaning of a sentence. This doesn't appear to be able to do that.
Still, huge progress over where we were just a couple years ago.
For language models I understand the thinking quality is different. But for TTS? Do anyone used small models in production use case?