FilterHN

Metricon

3 months ago

[-]

GGUF version created by "isaiahbjork" which is compatible with LM Studio and llama.cpp server at: https://github.com/isaiahbjork/orpheus-tts-local/

To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock

3 months ago

[-]

I've been testing this out, it's quite good and especially fast. Crazy that this is working so well at Q4

Imustaskforhelp

3 months ago

[-]

Can somebody please create a gradio client for this as well. I really want to try this out but the complexity messes me up.

3 months ago

[-]

Wait, how do you get audio out of llama-server?

hexaga

3 months ago

[-]

Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.

Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.

The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.

3 months ago

[-]

got it, so inference in llama.cpp server won't actually get me any audio directly

Metricon

3 months ago

[-]

If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`

On a Nvidia 4090, it's producing:

  prompt eval time =      17.93 ms /    24 tokens (    0.75 ms per token,  1338.39 tokens per second)

         eval time =    2382.95 ms /   421 tokens (    5.66 ms per token,   176.67 tokens per second)

        total time =    2400.89 ms /   445 tokens

*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.

3 months ago

[-]

is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.

See https://github.com/isaiahbjork/orpheus-tts-local

gianpaj

3 months ago

[-]

You need to decode the tokens into audio. See `convert_to_audio` method in `decoder.py`

You can run `python gguf_orpheus.py --text "Hello, this is a test" --voice tara` and connect to the llama-server

See my GH issue example output https://github.com/isaiahbjork/orpheus-tts-local/issues/15

[1]: https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA

huijzer

3 months ago

[-]

I always am a bit skeptical of these demos, and indeed I think they didn't put much effort into getting the most out of ElevenLabs. In the demo, they used the Brian voice. For the first example, I can get this in ElevenLabs [1]. Stability was set to 20 here and all the other settings were at their default. Having stability at the default of 50 sounds more like what is in the demo on the site [2].

Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.

[2]: https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7

[3]: https://github.com/transformrs/trv

[4]: https://youtu.be/Ni-dKlCpnb4

sebastiennight

3 months ago

[-]

Great demo on #4 ; and content-wise, a good lesson I needed to be reminded of.

I was such a fan of CoquiTTS and so happy when they launched a commercially licensed offering. I didn't mind taking a small hit on quality if it enabled us to support them.

And then, the quality of the API outputs were lower than what the self-hosted open source Coqui model provided... I'm thinking this was one of the reasons usage was not at the level they hoped for, and they ended up folding.

The saddest part is they still didn't assign commercial rights to the open-source model, so I think Coqui is in a dead-end now.

Imustaskforhelp

3 months ago

[-]

the [4] is such that since you've told me that its AI , my brain can say that of course its AI , but if you hadn't told me that , I might have thought that maybe this guy speaks like this or reading it in monotonous-ish way (like reading from a script?) and wants to sound professional.

Crazy.

Though I still wish open source to better than elevenlabs. but its all just a dream.

thorum

3 months ago

[-]

It’s kind of like ChatGPT writing, where it can easily fool people who see it for the first time, but after a while you start to recognize the common patterns.

hadlock

3 months ago

[-]

I'm looking forward to having an end-to-end "docker compose up" solution for self hosted chatgpt conversational voice mode. This is probably possible today, with enough glue code, but I haven't seen a neatly wrapped solution yet on par with ollama's.

nickthegreek

3 months ago

[-]

You can glue it with home assistant right now, but it’s not a simple docker compose. Piper TTS and Kokoro were the main 2 voice engines people are using.

Orpheus would be great to get wired up. I’m wondering how well their smallest model will run and if it will be fast enough for realtime

3 months ago

[-]

With some tweaking I was able to get the current 3B's "realtime" streaming demo running on my 12GB 4070 Super with about a second of latency running at BF16

3 months ago

[-]

Open WebUI has this already. Works okay, there's definitely a lot a fair bit of latency tho.

tough

3 months ago

[-]

harbor is a great docker bedrock for llm tools, has some tts stuff havent tried them https://github.com/av/harbor/wiki/1.1-Harbor-App#installatio...

rcarmo

3 months ago

[-]

Slightly less enthusiastic Californian - good - but the “British” voice feels cringe.

ben_w

3 months ago

[-]

Aye. As a native Brit myself, I'm not entirely sure which region that accent is supposed to be from.

It's the vocal equivalent of a triple-jointed arm, or a horizon that's different on the left and right side of a portrait.

nico

3 months ago

[-]

> even on an A100 40GB for the 3 billion parameter model

Would any of the models run on something like a raspberry pi?

How about a smartphone?

3 months ago

[-]

They're going to be releasing a few more smaller models, as small as 150M

That said if you want something to use today on a Pi you should check out Kokoro

hadlock

3 months ago

[-]

What kind of binary do you run Kokoro with for audio output

eternityforest

3 months ago

[-]

I use sherpa-onnx, which is great because it also does Piper without any dependencies that recent python versions get angry about.

satvikpendem

3 months ago

[-]

Is there some sort of better tutorial for sherpa-onnx? I tried looking into it but it seemed quite complex to get going, last I checked.

csukuangfj

3 months ago

[-]

I am one of the authors of sherpa-onnx. Can you describe why you feel it is complex? If you use Python, all you need is to run pip install sherpa-onnx, and then download a model and use the example code from the folder python-api-exmaples

satvikpendem

3 months ago

[-]

Hi, I vouched your comment since it was dead, presumably because yours is a new account. I'm looking to use it in a Flutter app, possibly with flutter_rust_bridge FFI if needed, so I was wondering how to do that, as well as where to get the models and how to use them. I didn't see any end to end example in the docs.

BrutalCoding

3 months ago

[-]

I used FFI in the past (C header) but sherpa-onnx is available on pub.dev these days. Look for "sherpa_onnx".

You can also point sherpa_onnx in your pubspec.yaml file to a local dir (after cloning the repo somewhere on your file system) or point to a specific git commit hash, and don't forget to specify the path because its not the root of the repo. Here's a link to the dir of the flutter package https://github.com/k2-fsa/sherpa-onnx/tree/master/flutter.

The downloads of compatible models can be found at their GitHub Releases but tbh it's a bit of a strange setup IMO. Here's the page for TTS models for example: https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-model...

That same "release" (page) gets updated from time to time with new models. Use a bookmark :p

@csukuangfj, thanks for sharing the hard work. Nice to see you here.

3 months ago

[-]

You can run it with Python or in the browser with WASM

deet

3 months ago

[-]

Impressive for a small model.

Two questions / thoughts:

1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?

2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)

mmoskal

3 months ago

[-]

I wonder how can it be Apache if it's based on Llama?

Philpax

3 months ago

[-]

That's a good question - I was initially thinking that it was pretrained from scratch using the Llama arch, but https://github.com/canopyai/Orpheus-TTS/blob/main/pretrain/c... implies the use of 3.2 3B as a base.

https://github.com/canopyai/Orpheus-TTS/issues/33#issuecomme...

bfLives

3 months ago

[-]

Looks like only the code is Apache, not the weights:

> the code in this repo is Apache 2 now added, the model weights are the same as the Llama license as they are a derivative work.

ForTheKidz

3 months ago

[-]

It sounds like reading from a script, or like an influencer. In that sense it's quite good: i could buy this is human.

However it's not a very good reading of the script, in human terms. It feels even more forced and phony than aforementioned influencers.

evrimoztamur

3 months ago

[-]

Impressive for a small model, and I think it could be improved by fixing individual phrases sounding like they were recorded separately. Subtle differences in sound quality, and no natural transitions between individual words, it fails to sound realistic. I think these should be fixable as we figure out how to fine tune on (and thus normalizing) recording characteristics.

8organicbits

3 months ago

[-]

A couple things I noticed:

- in the prompt "SO serious" it pronounces each letter as "ess oh" instead of emphasizing the word "so"

- there's no breathing sounds or natural breathing based pauses

Choosing which words in a sentence to emphasize can completely change the meaning of a sentence. This doesn't appear to be able to do that.

Still, huge progress over where we were just a couple years ago.